--- language: - en tags: - text-to-speech - tts - xtts - xtts-v2 - coqui - indian-english - fine-tuned license: other base_model: coqui/XTTS-v2 --- # XTTS v2 β€” Indian English Fine-Tune

GitHub Recipe Base Model Language

πŸ”— **Recipe:** [xttsv2-recipe](https://github.com/Jeevav62/tts-finetune-recipes/tree/main/xttsv2-recipe) A fine-tuned version of [Coqui XTTS v2](https://huggingface.co/coqui/XTTS-v2) adapted for **Indian-accented English** speech synthesis. XTTS v2 is a 518M-parameter GPT-based multilingual TTS model with zero-shot voice cloning. This fine-tune improves naturalness, prosody, and pronunciation for Indian-English speakers and vocabulary β€” Indian names, Indian city names, lakh/crore number system, and tech acronyms. --- ## Checkpoint Info | Detail | Value | |---|---| | Best step | 11,074 | | Total steps trained | 11,250 | | Best eval loss | **2.697** (down from 3.766 at start) | | Eval mel loss | 2.670 (βˆ’1.065) | | Eval text loss | 0.027 (βˆ’0.003) | The checkpoint (`best_model.pth`) is a **full training checkpoint** β€” it contains model weights AND optimizer state. This means you can both run inference AND resume training from it. - Model weights only: ~2 GB - With optimizer state (Adam m + v buffers): **5.3 GB total** --- ## Files | File | Size | Description | |---|---|---| | `best_model.pth` | 5.3 GB | Full training checkpoint (weights + optimizer state) | | `config.json` | 6 KB | Training config (required for inference and resuming training) | **Additional files required for inference** (not stored here β€” auto-downloaded by the TTS library on first use): | File | Source | |---|---| | `vocab.json` | Downloaded from Coqui's servers automatically | | `dvae.pth` | Downloaded from Coqui's servers automatically | | `mel_stats.pth` | Downloaded from Coqui's servers automatically | --- ## Inference ### Install ```bash pip install TTS>=0.22.0 torch>=2.1 torchaudio>=2.1 ``` ### Basic inference (voice cloning) ```python from TTS.api import TTS tts = TTS(model_path="best_model.pth", config_path="config.json").to("cuda") tts.tts_to_file( text="The meeting is on 23rd April at 4:30 PM IST. Dr. Narayanan from Chennai will present.", speaker_wav="reference.wav", # 5-10 sec clean WAV of the target speaker at 24 kHz language="en", file_path="output.wav" ) ``` ### Inference with the XTTS model directly ```python import torch from TTS.tts.configs.xtts_config import XttsConfig from TTS.tts.models.xtts import Xtts # Load config and model config = XttsConfig() config.load_json("config.json") model = Xtts.init_from_config(config) model.load_checkpoint(config, checkpoint_path="best_model.pth", eval=True) model.cuda() # Get speaker conditioning from reference audio gpt_cond_latent, speaker_embedding = model.get_conditioning_latents( audio_path=["reference.wav"] ) # Synthesize out = model.inference( text="Dr. Narayanan Subramanian from Chennai will meet Aravind Sridhar tomorrow.", language="en", gpt_cond_latent=gpt_cond_latent, speaker_embedding=speaker_embedding, temperature=0.7, repetition_penalty=2.5, top_k=50, top_p=0.85, ) import soundfile as sf sf.write("output.wav", out["wav"], 24000) ``` ### Reference audio requirements - Format: WAV, mono - Sample rate: 24 kHz (22050 Hz also works β€” TTS library resamples) - Duration: **5–10 seconds** (longer does not help) - Quality: Clean, no background noise, no music - Same speaker as your target voice --- ## What it handles well - **Indian names** β€” Narayanan, Subramanian, Aravind, Sridhar, Venkatesh, Rajkumar - **Indian cities** β€” Bengaluru, Chennai, Hyderabad, Coimbatore, Tiruchirappalli, Madurai - **Indian number system** β€” `Rs.7,25,000` β†’ *"seven lakh twenty five thousand rupees"*, crore - **Tech acronyms** β€” DDoS, GPU, IST, MMS, CEO, VRAM, AI/ML, CI/CD - **Natural Indian-English prosody** β€” accent and rhythm of Indian English speech --- ## Resuming Training / Fine-Tuning Further The checkpoint includes full optimizer state, so training can be resumed exactly where it left off. ### Dataset format Pipe-delimited CSV, no header, 2 columns: ``` clip_0001|The total cost is Rs.7,25,000 for the Bengaluru office setup. clip_0002|Dr. Narayanan will present the quarterly results tomorrow at 4 PM IST. clip_0003|Please forward the report to admin at example dot com by Friday. ``` Audio files: `wavs/clip_0001.wav`, `wavs/clip_0002.wav`, etc. - Format: mono WAV, **22050 Hz**, 1–24 seconds per clip ### Resume training from this checkpoint ```python # In your train script, point the base model at this checkpoint: XTTS_CHECKPOINT = "best_model.pth" # this file XTTS_CONFIG = "config.json" # this file # Then run the standard XTTS v2 training recipe: # TTS/recipes/ljspeech/xtts_v2/train_gpt_xtts.py # with formatter="thorsten" for 2-column id|text metadata ``` Key config values used during training (from `config.json`): | Parameter | Value | |---|---| | `lr` | `5e-6` | | `batch_size` | `2` (per GPU, 2Γ— GPUs) | | `eval_split_size` | `0.1` (10% held out) | | `eval_split_max_size` | `80` clips | | `save_step` | `500` | | `lr_scheduler` | `MultiStepLR` | | `lr_scheduler milestones` | `[3500, 7000, 10000]` | | `lr_scheduler gamma` | `0.5` (halved at each milestone) | | `gpt_cond_len` | `12` seconds | | `gpt_cond_chunk_len` | `4` seconds | | `gpt_max_audio_tokens` | `605` | | `gpt_max_text_tokens` | `402` | | `distributed_backend` | `nccl` (multi-GPU DDP) | ### Tips for further fine-tuning - **Lower the LR** when resuming β€” the model is already converged, use `lr: 1e-6` or `2e-6` - **More data is better** β€” 1000+ clips is the sweet spot; diminishing returns above 5000 - **Watch eval mel loss** β€” should stay below 2.7; if it climbs, you're overfitting - **Reference audio matters more than model** β€” use a very clean 8-second clip for best voice cloning results - **OOM during training** β€” reduce `max_wav_len` to `220500` (10 sec) or lower `batch_size` to 1 --- ## Training History | Run | Steps | Best eval loss | Notes | |---|---|---|---| | Run 1 (Mar 17) | ~1,650 | 2.735 | Initial run, still converging | | Run 2 (Mar 18) | ~5,085 | 2.794 | Different config, higher loss | | **Run 3 (Mar 18)** | **11,074** | **2.697** | **This checkpoint β€” best overall** | | Run 4 (Mar 21) | ~6,885 | 2.829 | Interrupted, worse result | Run 3 achieved the lowest eval loss across all experiments and is the checkpoint published here. --- ## Credits - **[Coqui TTS / XTTS v2](https://github.com/coqui-ai/TTS)** β€” Base model and training framework (MIT License) - **[Thorsten MΓΌller](https://github.com/thorstenMueller)** β€” Dataset format convention (`id|text` two-column pipe-delimited) --- ## License The fine-tuned weights follow the [Coqui Public Model License (CPML)](https://coqui.ai/cpml) of the base XTTS v2 model.