cxl
Published on 2025-05-30 / 13 Visits
1
0

CosyVoice2体验

在人工智能语音合成领域,CosyVoice2 以其卓越的性能和丰富的功能脱颖而出。无论是为有声读物增添生动的语音,还是为智能客服打造自然流畅的交互体验,CosyVoice2 都能大显身手。相对于 Spark-TTSIndex-TTS 可玩性会更高一点,目前已经上线到个人网站,可通过点击网站菜单中的 CosyVoice-TTS 进行体验,有问题可留言咨询。

CosyVoice2 是什么​

CosyVoice2 是阿里巴巴集团开发的先进流式语音合成模型 。它在语音合成技术上实现了重大突破,不仅能生成高度自然的语音,还支持多语言和多说话人,为全球用户提供定制化服务。其核心优势在于:​

  1. 接近人类发音自然度:通过采用最新的大型语言模型(LLMs)和流式处理技术,生成的语音流畅自然,几乎与人类发音无异,极大提升了用户在各类语音交互场景中的体验。​

  1. 高效的语音编码:利用有限标量量化(FSQ)技术和预训练的大型语言模型作为骨干网络,精确捕捉和再现语音信号细节,输出清晰、自然且富有表现力的语音。​

  1. 支持多场景应用:无论是虚拟助手、在线客服,还是语音聊天应用,CosyVoice2 都能凭借其低延迟和高质量的语音合成效果,为用户带来流畅的交流体验。​

CosyVoice2 的优势

  1. 精准发音与清晰音质:相比以往版本,CosyVoice2 在发音精准度上有显著提升,减少了 30%-50% 的发音错误率 ,每个字都能做到字正腔圆,效果直逼专业播音。音质评分也从 5.4 分提升到 5.53 分 ,听起来更加自然、舒适。​

  1. 超低延迟:仅 150ms 的超低延迟 ,让实时语音交互和在线语音翻译等场景变得流畅无比,彻底告别卡顿烦恼。​

  1. 方言口音与情感控制:加入了细致的方言和口音调整功能,能够模拟地道的粤语、四川话等多种方言。同时,它还能根据指令模拟各种情感,如愉悦、悲伤、激动等,使语音更加生动有趣。​

  1. 多语言支持:专注于自然语音生成,支持中、英、日、粤、韩五种语言 ,轻松满足不同语言需求的用户。​

  1. 零样本语音生成与跨语种合成:只需 3-10 秒的音频样本,就能完美模仿用户声音,连韵律和情感都能精准复制 。还能实现跨语种生成,堪称 “变声神器”。​

  1. 富文本与自然语言控制:支持通过富文本或自然语言控制语音的情感和韵律 ,用户可根据自身需求定制更具表现力的语音。

环境搭建全流程

1. 本教程操作环境

  • 系统:Ubuntu 22.04 LTS

  • python:3.10

  • CUDA:12.9

  • 显存:6G

2. 安装步骤

步骤 1:克隆代码库

  git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
  cd CosyVoice
  git submodule update --init --recursive  # 确保子模块成功克隆

步骤 2:创建虚拟环境

使用的是 conda 环境管理工具

  conda create -n cosyvoice python=3.10
  conda activate cosyvoice

步骤 3:安装相关依赖

  conda install -y -c conda-forge pynini==2.1.5
  pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/
  # 如果遇到sox兼容性问题或错误则执行以下命令安装对应包
  sudo apt-get install sox libsox-dev

步骤 4:模型下载

  # 方法一:modelscopde方式下载
  # 进入python环境执行以下命令或将以下命令保存到py文件后使用python命令执行
  # SDK模型下载
  from modelscope import snapshot_download
  snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
  snapshot_download('iic/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M')
  snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
  snapshot_download('iic/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')
  snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')

  # 方法二:git方式下载
  # git模型下载,请确保已安装git lfs
  mkdir -p pretrained_models
  git clone https://www.modelscope.cn/iic/CosyVoice2-0.5B.git pretrained_models/CosyVoice2-0.5B
  git clone https://www.modelscope.cn/iic/CosyVoice-300M.git pretrained_models/CosyVoice-300M
  git clone https://www.modelscope.cn/iic/CosyVoice-300M-SFT.git pretrained_models/CosyVoice-300M-SFT
  git clone https://www.modelscope.cn/iic/CosyVoice-300M-Instruct.git pretrained_models/CosyVoice-300M-Instruct
  git clone https://www.modelscope.cn/iic/CosyVoice-ttsfrd.git pretrained_models/CosyVoice-ttsfrd

3.运行

测试脚本

脚本1

编写test_CosyVoice2-0.5B.py文件保存后通过 python test_CosyVoice2-0.5B.py 执行,会生成三个文件

  import sys
  sys.path.append('third_party/Matcha-TTS')
  from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
  from cosyvoice.utils.file_utils import load_wav
  import torchaudio

  # 使用 CosyVoice2-0.5B 模型
  cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B', load_jit=False, load_trt=False, fp16=False)

  # NOTE if you want to reproduce the results on https://funaudiollm.github.io/cosyvoice2, please add text_frontend=False during inference
  # zero_shot usage
  prompt_speech_16k = load_wav('./asset/zero_shot_prompt.wav', 16000)
  for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):
      torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
  
  # save zero_shot spk for future usage
  assert cosyvoice.add_zero_shot_spk('希望你以后能够做的比我还好呦。', prompt_speech_16k, 'my_zero_shot_spk') is True
  for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '', '', zero_shot_spk_id='my_zero_shot_spk', stream=False)):
      torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
  cosyvoice.save_spkinfo()
  
  # fine grained control, for supported control, check cosyvoice/tokenizer/tokenizer.py#L248
  for i, j in enumerate(cosyvoice.inference_cross_lingual('在他讲述那个荒诞故事的过程中,他突然[laughter]停下来,因为他自己也被逗笑了[laughter]。', prompt_speech_16k, stream=False)):
      torchaudio.save('fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
  
  # instruct usage
  for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '用四川话说这句话', prompt_speech_16k, stream=False)):
      torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
  
  # bistream usage, you can use generator as input, this is useful when using text llm model as input
  # NOTE you should still have some basic sentence split logic because llm can not handle arbitrary sentence length
  def text_generator():
      yield '收到好友从远方寄来的生日礼物,'
      yield '那份意外的惊喜与深深的祝福'
      yield '让我心中充满了甜蜜的快乐,'
      yield '笑容如花儿般绽放。'
  for i, j in enumerate(cosyvoice.inference_zero_shot(text_generator(), '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):
      torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

脚本2

编写 test_CosyVoice-300M-SFT.py 文件保存后通过 python test_CosyVoice-300M-SFT.py 执行,会生成五个语音文件

  import sys
  sys.path.append('third_party/Matcha-TTS')
  from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
  from cosyvoice.utils.file_utils import load_wav
  import torchaudio

  cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-SFT', load_jit=False, load_trt=False, fp16=False)
  # sft usage
  print(cosyvoice.list_available_spks())
  # change stream=True for chunk stream inference
  for i, j in enumerate(cosyvoice.inference_sft('你好,我是通义生成式语音大模型,请问有什么可以帮您的吗?', '中文女', stream=False)):
      torchaudio.save('sft_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
  
  cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M')
  # zero_shot usage, <|zh|><|en|><|jp|><|yue|><|ko|> for Chinese/English/Japanese/Cantonese/Korean
  prompt_speech_16k = load_wav('./asset/zero_shot_prompt.wav', 16000)
  for i, j in enumerate(cosyvoice.inference_zero_shot('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', '希望你以后能够做的比我还好呦。', prompt_speech_16k, stream=False)):
      torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
  # cross_lingual usage
  prompt_speech_16k = load_wav('./asset/cross_lingual_prompt.wav', 16000)
  for i, j in enumerate(cosyvoice.inference_cross_lingual('<|en|>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that\'s coming into the family is a reason why sometimes we don\'t buy the whole thing.', prompt_speech_16k, stream=False)):
      torchaudio.save('cross_lingual_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
  # vc usage
  prompt_speech_16k = load_wav('./asset/zero_shot_prompt.wav', 16000)
  source_speech_16k = load_wav('./asset/cross_lingual_prompt.wav', 16000)
  for i, j in enumerate(cosyvoice.inference_vc(source_speech_16k, prompt_speech_16k, stream=False)):
      torchaudio.save('vc_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
  
  cosyvoice = CosyVoice('pretrained_models/CosyVoice-300M-Instruct')
  # instruct usage, support <laughter></laughter><strong></strong>[laughter][breath]
  for i, j in enumerate(cosyvoice.inference_instruct('在面对挑战时,他展现了非凡的<strong>勇气</strong>与<strong>智慧</strong>。', '中文男', 'Theo \'Crimson\', is a fiery, passionate rebel leader. Fights with fervor for justice, but struggles with impulsiveness.', stream=False)):
      torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

网站页面(GUI)

  # 使用 CosyVoice-300M-Instruct 模型,运行发现此模型可以正常加载预训练中的音色列表,而下面模型则不会显示
  python webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M-Instruct
  # 使用 CosyVoice2-0.5B 模型
  python webui.py --port 50000 --model_dir pretrained_models/CosyVoice2-0.5B

彩蛋

CosyVoice提供了语气的功能,记得好像spark-tts也有这功能,就是在合成文本中添加语气标记,转换时会在语音中表现出来,如

  1. 他搞的一个恶作剧,让大家<laughter>忍俊不禁</laughter>。

  2. 用开心的语气说<|endofprompt|>参加朋友的婚礼,看着新人幸福的笑脸,我感到无比开心。这样的爱与承诺,总是令人心生向往。

  3. 追求卓越不是终点,它需要你每天都<strong>付出</strong>和<strong>精进</strong>,最终才能达到巅峰。

  4. 当你用心去倾听一首音乐时[breath],你会开始注意到那些细微的音符变化[breath],并通过它们感受到音乐背后的情感。

其它支持的标记查看以下源码内容,感兴趣的可以挨个试试:


Comment