--- language: - en tags: - text-to-speech - tts - kokoro - styletts2 - indian-english - fine-tuned license: apache-2.0 base_model: hexgrad/Kokoro-82M --- # Kokoro-82M — Indian English Fine-Tune

GitHub Recipe Base Model Language

🔗 **Recipe:** [kokoro-recipe](https://github.com/Jeevav62/tts-finetune-recipes/tree/main/kokoro-recipe) A fine-tuned version of [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M) adapted for **Indian-accented English** speech synthesis. This checkpoint handles Indian-English particulars that the base model struggles with — Indic names and city names, the Indian number system (lakh/crore), common Indian-English abbreviations, and the natural rhythm and accent of Indian English speech. --- ## Files | File | Description | |---|---| | `kokoro_converted.pth` | Fine-tuned model weights (Kokoro inference format) | | `voicepack.pt` | Speaker style tensor `[510, 1, 256]` — male Indian-English voice | --- ## What it handles well - **Indian names** — Narayanan, Subramanian, Aravind, Sridhar, Venkatesh, Rajkumar, and similar - **Indian cities** — Bengaluru, Chennai, Hyderabad, Coimbatore, Tiruchirappalli, Madurai - **Indian number system** — `Rs.7,25,000` → *"seven lakh twenty five thousand rupees"*, crore expansion - **Tech acronyms** — DDoS, GPU, IST, MMS, CEO, VRAM, AI/ML, CI/CD - **Natural Indian-English prosody** — accent and rhythm characteristic of Indian English speakers --- ## Usage ```python import torch import numpy as np import soundfile as sf from kokoro import KModel, KPipeline # Load model and voicepack device = "cuda" # or "cpu" model = KModel( repo_id="hexgrad/Kokoro-82M", config="path/to/config.json", # Kokoro config.json (from hexgrad/Kokoro-82M) model="kokoro_converted.pth" ).to(device).eval() pipeline = KPipeline(lang_code="b", repo_id="hexgrad/Kokoro-82M", model=model) voice = torch.load("voicepack.pt", map_location="cpu", weights_only=True) # Synthesize text = "Dr. Narayanan from Chennai will present the quarterly report — total revenue was Rs.3,75,000." gen = pipeline.generate_from_tokens(text, voice=voice, speed=1.0) audio = np.concatenate([a for _, _, a in gen]) sf.write("output.wav", audio, 24000) ``` **G2P note:** For best results, run text through the same G2P pipeline used during training: - Language: `en-gb` (British espeak via [misaki](https://github.com/hexgrad/misaki)) - Custom lexicon for Indian names recommended (use the same lexicon at inference as at training) --- ## Training Details | Detail | Value | |---|---| | Base model | hexgrad/Kokoro-82M | | Architecture | StyleTTS2 / ISTFTNet | | Training framework | [kokoro-deutsch](https://github.com/semidark/kokoro-deutsch) | | Dataset | ~1,058 clips, single male Indian-English speaker, ~1.5–2h total | | Audio spec | 24 kHz mono WAV | | Stage 1 | Acoustic warmup (decoder, style_encoder, text_aligner, pitch_extractor) | | Stage 2 | 20 epochs prosody fine-tuning (predictor, predictor_encoder) | | Key hyperparams | `lambda_F0: 2.0`, `batch_size: 4`, `max_len: 200`, `lr: 5e-5` | | G2P | misaki `en-gb` + custom Indian-English lexicon | --- ## Credits - **[hexgrad/Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M)** — Base TTS model (Apache 2.0) - **[yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)** — Underlying architecture (MIT) - **[semidark/kokoro-deutsch](https://github.com/semidark/kokoro-deutsch)** — Training framework (Apache 2.0) - **[hexgrad/misaki](https://github.com/hexgrad/misaki)** — G2P engine (Apache 2.0) --- ## License Apache 2.0 — same as the base Kokoro-82M model.