Remove self HF link, keep only GitHub recipe link

1fc0f11 verified 1 day ago

4.17 kB

language:
  - en
tags:
  - text-to-speech
  - tts
  - kokoro
  - styletts2
  - indian-english
  - fine-tuned
license: apache-2.0
base_model: hexgrad/Kokoro-82M

Kokoro-82M — Indian English Fine-Tune

🔗 Recipe: kokoro-recipe

A fine-tuned version of Kokoro-82M adapted for Indian-accented English speech synthesis.

This checkpoint handles Indian-English particulars that the base model struggles with — Indic names and city names, the Indian number system (lakh/crore), common Indian-English abbreviations, and the natural rhythm and accent of Indian English speech.

Files

File	Description
`kokoro_converted.pth`	Fine-tuned model weights (Kokoro inference format)
`voicepack.pt`	Speaker style tensor `[510, 1, 256]` — male Indian-English voice

What it handles well

Indian names — Narayanan, Subramanian, Aravind, Sridhar, Venkatesh, Rajkumar, and similar
Indian cities — Bengaluru, Chennai, Hyderabad, Coimbatore, Tiruchirappalli, Madurai
Indian number system — Rs.7,25,000 → "seven lakh twenty five thousand rupees", crore expansion
Tech acronyms — DDoS, GPU, IST, MMS, CEO, VRAM, AI/ML, CI/CD
Natural Indian-English prosody — accent and rhythm characteristic of Indian English speakers

Usage

import torch
import numpy as np
import soundfile as sf
from kokoro import KModel, KPipeline

# Load model and voicepack
device = "cuda"  # or "cpu"
model = KModel(
    repo_id="hexgrad/Kokoro-82M",
    config="path/to/config.json",         # Kokoro config.json (from hexgrad/Kokoro-82M)
    model="kokoro_converted.pth"
).to(device).eval()

pipeline = KPipeline(lang_code="b", repo_id="hexgrad/Kokoro-82M", model=model)
voice = torch.load("voicepack.pt", map_location="cpu", weights_only=True)

# Synthesize
text = "Dr. Narayanan from Chennai will present the quarterly report — total revenue was Rs.3,75,000."
gen = pipeline.generate_from_tokens(text, voice=voice, speed=1.0)
audio = np.concatenate([a for _, _, a in gen])
sf.write("output.wav", audio, 24000)

G2P note: For best results, run text through the same G2P pipeline used during training:

Language: en-gb (British espeak via misaki)
Custom lexicon for Indian names recommended (use the same lexicon at inference as at training)

Training Details

Detail	Value
Base model	hexgrad/Kokoro-82M
Architecture	StyleTTS2 / ISTFTNet
Training framework	kokoro-deutsch
Dataset	~1,058 clips, single male Indian-English speaker, ~1.5–2h total
Audio spec	24 kHz mono WAV
Stage 1	Acoustic warmup (decoder, style_encoder, text_aligner, pitch_extractor)
Stage 2	20 epochs prosody fine-tuning (predictor, predictor_encoder)
Key hyperparams	`lambda_F0: 2.0`, `batch_size: 4`, `max_len: 200`, `lr: 5e-5`
G2P	misaki `en-gb` + custom Indian-English lexicon

Credits

hexgrad/Kokoro-82M — Base TTS model (Apache 2.0)
yl4579/StyleTTS2 — Underlying architecture (MIT)
semidark/kokoro-deutsch — Training framework (Apache 2.0)
hexgrad/misaki — G2P engine (Apache 2.0)

License

Apache 2.0 — same as the base Kokoro-82M model.