Kokoro-82M β€” Indian English Fine-Tune

GitHub Recipe Base Model Language

πŸ”— Recipe: kokoro-recipe

A fine-tuned version of Kokoro-82M adapted for Indian-accented English speech synthesis.

This checkpoint handles Indian-English particulars that the base model struggles with β€” Indic names and city names, the Indian number system (lakh/crore), common Indian-English abbreviations, and the natural rhythm and accent of Indian English speech.


Files

File Description
kokoro_converted.pth Fine-tuned model weights (Kokoro inference format)
voicepack.pt Speaker style tensor [510, 1, 256] β€” male Indian-English voice

What it handles well

  • Indian names β€” Narayanan, Subramanian, Aravind, Sridhar, Venkatesh, Rajkumar, and similar
  • Indian cities β€” Bengaluru, Chennai, Hyderabad, Coimbatore, Tiruchirappalli, Madurai
  • Indian number system β€” Rs.7,25,000 β†’ "seven lakh twenty five thousand rupees", crore expansion
  • Tech acronyms β€” DDoS, GPU, IST, MMS, CEO, VRAM, AI/ML, CI/CD
  • Natural Indian-English prosody β€” accent and rhythm characteristic of Indian English speakers

Usage

import torch
import numpy as np
import soundfile as sf
from kokoro import KModel, KPipeline

# Load model and voicepack
device = "cuda"  # or "cpu"
model = KModel(
    repo_id="hexgrad/Kokoro-82M",
    config="path/to/config.json",         # Kokoro config.json (from hexgrad/Kokoro-82M)
    model="kokoro_converted.pth"
).to(device).eval()

pipeline = KPipeline(lang_code="b", repo_id="hexgrad/Kokoro-82M", model=model)
voice = torch.load("voicepack.pt", map_location="cpu", weights_only=True)

# Synthesize
text = "Dr. Narayanan from Chennai will present the quarterly report β€” total revenue was Rs.3,75,000."
gen = pipeline.generate_from_tokens(text, voice=voice, speed=1.0)
audio = np.concatenate([a for _, _, a in gen])
sf.write("output.wav", audio, 24000)

G2P note: For best results, run text through the same G2P pipeline used during training:

  • Language: en-gb (British espeak via misaki)
  • Custom lexicon for Indian names recommended (use the same lexicon at inference as at training)

Training Details

Detail Value
Base model hexgrad/Kokoro-82M
Architecture StyleTTS2 / ISTFTNet
Training framework kokoro-deutsch
Dataset ~1,058 clips, single male Indian-English speaker, ~1.5–2h total
Audio spec 24 kHz mono WAV
Stage 1 Acoustic warmup (decoder, style_encoder, text_aligner, pitch_extractor)
Stage 2 20 epochs prosody fine-tuning (predictor, predictor_encoder)
Key hyperparams lambda_F0: 2.0, batch_size: 4, max_len: 200, lr: 5e-5
G2P misaki en-gb + custom Indian-English lexicon

Credits


License

Apache 2.0 β€” same as the base Kokoro-82M model.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jeevav62/kokoro-82m-indian-en

Finetuned
(29)
this model