CosyVoice2-0.5B — Akan Fine-tune

Fine-tuned version of CosyVoice2-0.5B for Akan (Twi / Fante) text-to-speech synthesis.

Trained on the aka_asr subset of google/WaxalNLP with 101 speakers and ~10,000 utterances.

Usage

from huggingface_hub import snapshot_download
from cosyvoice.cli.cosyvoice import CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio

model_dir = snapshot_download('prince4332/CosyVoice2-0.5B-Akan')
model = CosyVoice2(model_dir, load_jit=False, load_trt=False, fp16=True)

# Zero-shot voice cloning
prompt_wav, sr = torchaudio.load('reference_akan.wav')   # 16 kHz mono
if sr != 16000:
    prompt_wav = torchaudio.functional.resample(prompt_wav, sr, 16000)

for chunk in model.inference_zero_shot(
    tts_text='Meda wo ase.',
    prompt_text='Akwaaba!',
    prompt_speech_16k=prompt_wav,
    stream=False,
):
    torchaudio.save('output.wav', chunk['tts_speech'], 22050)

Training Details

  • Base model: FunAudioLLM/CosyVoice2-0.5B
  • Fine-tuned modules: LLM (language model head)
  • Frozen modules: Flow, HiFi-GAN vocoder, speech tokeniser, speaker encoder
  • Dataset: google/WaxalNLP aka_asr split
  • Hardware: A100 / T4 GPU
Downloads last month
118
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prince4332/CosyVoice2-0.5B-Akan

Quantized
(9)
this model

Dataset used to train prince4332/CosyVoice2-0.5B-Akan