metadata
language:
- en
tags:
- text-to-speech
- tts
- kokoro
- styletts2
- indian-english
- fine-tuned
license: apache-2.0
base_model: hexgrad/Kokoro-82M
Kokoro-82M β Indian English Fine-Tune
π Recipe: kokoro-recipe
A fine-tuned version of Kokoro-82M adapted for Indian-accented English speech synthesis.
This checkpoint handles Indian-English particulars that the base model struggles with β Indic names and city names, the Indian number system (lakh/crore), common Indian-English abbreviations, and the natural rhythm and accent of Indian English speech.
Files
| File | Description |
|---|---|
kokoro_converted.pth |
Fine-tuned model weights (Kokoro inference format) |
voicepack.pt |
Speaker style tensor [510, 1, 256] β male Indian-English voice |
What it handles well
- Indian names β Narayanan, Subramanian, Aravind, Sridhar, Venkatesh, Rajkumar, and similar
- Indian cities β Bengaluru, Chennai, Hyderabad, Coimbatore, Tiruchirappalli, Madurai
- Indian number system β
Rs.7,25,000β "seven lakh twenty five thousand rupees", crore expansion - Tech acronyms β DDoS, GPU, IST, MMS, CEO, VRAM, AI/ML, CI/CD
- Natural Indian-English prosody β accent and rhythm characteristic of Indian English speakers
Usage
import torch
import numpy as np
import soundfile as sf
from kokoro import KModel, KPipeline
# Load model and voicepack
device = "cuda" # or "cpu"
model = KModel(
repo_id="hexgrad/Kokoro-82M",
config="path/to/config.json", # Kokoro config.json (from hexgrad/Kokoro-82M)
model="kokoro_converted.pth"
).to(device).eval()
pipeline = KPipeline(lang_code="b", repo_id="hexgrad/Kokoro-82M", model=model)
voice = torch.load("voicepack.pt", map_location="cpu", weights_only=True)
# Synthesize
text = "Dr. Narayanan from Chennai will present the quarterly report β total revenue was Rs.3,75,000."
gen = pipeline.generate_from_tokens(text, voice=voice, speed=1.0)
audio = np.concatenate([a for _, _, a in gen])
sf.write("output.wav", audio, 24000)
G2P note: For best results, run text through the same G2P pipeline used during training:
- Language:
en-gb(British espeak via misaki) - Custom lexicon for Indian names recommended (use the same lexicon at inference as at training)
Training Details
| Detail | Value |
|---|---|
| Base model | hexgrad/Kokoro-82M |
| Architecture | StyleTTS2 / ISTFTNet |
| Training framework | kokoro-deutsch |
| Dataset | ~1,058 clips, single male Indian-English speaker, ~1.5β2h total |
| Audio spec | 24 kHz mono WAV |
| Stage 1 | Acoustic warmup (decoder, style_encoder, text_aligner, pitch_extractor) |
| Stage 2 | 20 epochs prosody fine-tuning (predictor, predictor_encoder) |
| Key hyperparams | lambda_F0: 2.0, batch_size: 4, max_len: 200, lr: 5e-5 |
| G2P | misaki en-gb + custom Indian-English lexicon |
Credits
- hexgrad/Kokoro-82M β Base TTS model (Apache 2.0)
- yl4579/StyleTTS2 β Underlying architecture (MIT)
- semidark/kokoro-deutsch β Training framework (Apache 2.0)
- hexgrad/misaki β G2P engine (Apache 2.0)
License
Apache 2.0 β same as the base Kokoro-82M model.