---
language:
- en
tags:
- text-to-speech
- tts
- xtts
- xtts-v2
- coqui
- indian-english
- fine-tuned
license: other
base_model: coqui/XTTS-v2
---

# XTTS v2 — Indian English Fine-Tune

<p>
  <a href="https://github.com/Jeevav62/tts-finetune-recipes/tree/main/xttsv2-recipe">
    <img src="https://img.shields.io/badge/GitHub-Fine--Tuning%20Recipe-181717?style=for-the-badge&logo=github&logoColor=white" alt="GitHub Recipe"/>
  </a>
  <a href="https://huggingface.co/coqui/XTTS-v2">
    <img src="https://img.shields.io/badge/Base%20Model-XTTS--v2-ff6b00?style=for-the-badge&logo=huggingface&logoColor=white" alt="Base Model"/>
  </a>
  <img src="https://img.shields.io/badge/Language-Indian%20English-blue?style=for-the-badge" alt="Language"/>
</p>

🔗 **Recipe:** [xttsv2-recipe](https://github.com/Jeevav62/tts-finetune-recipes/tree/main/xttsv2-recipe)


A fine-tuned version of [Coqui XTTS v2](https://huggingface.co/coqui/XTTS-v2) adapted for **Indian-accented English** speech synthesis.

XTTS v2 is a 518M-parameter GPT-based multilingual TTS model with zero-shot voice cloning. This fine-tune improves naturalness, prosody, and pronunciation for Indian-English speakers and vocabulary — Indian names, Indian city names, lakh/crore number system, and tech acronyms.

---

## Checkpoint Info

| Detail | Value |
|---|---|
| Best step | 11,074 |
| Total steps trained | 11,250 |
| Best eval loss | **2.697** (down from 3.766 at start) |
| Eval mel loss | 2.670 (−1.065) |
| Eval text loss | 0.027 (−0.003) |

The checkpoint (`best_model.pth`) is a **full training checkpoint** — it contains model weights AND optimizer state. This means you can both run inference AND resume training from it.

- Model weights only: ~2 GB
- With optimizer state (Adam m + v buffers): **5.3 GB total**

---

## Files

| File | Size | Description |
|---|---|---|
| `best_model.pth` | 5.3 GB | Full training checkpoint (weights + optimizer state) |
| `config.json` | 6 KB | Training config (required for inference and resuming training) |

**Additional files required for inference** (not stored here — auto-downloaded by the TTS library on first use):

| File | Source |
|---|---|
| `vocab.json` | Downloaded from Coqui's servers automatically |
| `dvae.pth` | Downloaded from Coqui's servers automatically |
| `mel_stats.pth` | Downloaded from Coqui's servers automatically |

---

## Inference

### Install

```bash
pip install TTS>=0.22.0 torch>=2.1 torchaudio>=2.1
```

### Basic inference (voice cloning)

```python
from TTS.api import TTS

tts = TTS(model_path="best_model.pth", config_path="config.json").to("cuda")

tts.tts_to_file(
    text="The meeting is on 23rd April at 4:30 PM IST. Dr. Narayanan from Chennai will present.",
    speaker_wav="reference.wav",   # 5-10 sec clean WAV of the target speaker at 24 kHz
    language="en",
    file_path="output.wav"
)
```

### Inference with the XTTS model directly

```python
import torch
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# Load config and model
config = XttsConfig()
config.load_json("config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_path="best_model.pth", eval=True)
model.cuda()

# Get speaker conditioning from reference audio
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path=["reference.wav"]
)

# Synthesize
out = model.inference(
    text="Dr. Narayanan Subramanian from Chennai will meet Aravind Sridhar tomorrow.",
    language="en",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.7,
    repetition_penalty=2.5,
    top_k=50,
    top_p=0.85,
)

import soundfile as sf
sf.write("output.wav", out["wav"], 24000)
```

### Reference audio requirements

- Format: WAV, mono
- Sample rate: 24 kHz (22050 Hz also works — TTS library resamples)
- Duration: **5–10 seconds** (longer does not help)
- Quality: Clean, no background noise, no music
- Same speaker as your target voice

---

## What it handles well

- **Indian names** — Narayanan, Subramanian, Aravind, Sridhar, Venkatesh, Rajkumar
- **Indian cities** — Bengaluru, Chennai, Hyderabad, Coimbatore, Tiruchirappalli, Madurai
- **Indian number system** — `Rs.7,25,000` → *"seven lakh twenty five thousand rupees"*, crore
- **Tech acronyms** — DDoS, GPU, IST, MMS, CEO, VRAM, AI/ML, CI/CD
- **Natural Indian-English prosody** — accent and rhythm of Indian English speech

---

## Resuming Training / Fine-Tuning Further

The checkpoint includes full optimizer state, so training can be resumed exactly where it left off.

### Dataset format

Pipe-delimited CSV, no header, 2 columns:

```
clip_0001|The total cost is Rs.7,25,000 for the Bengaluru office setup.
clip_0002|Dr. Narayanan will present the quarterly results tomorrow at 4 PM IST.
clip_0003|Please forward the report to admin at example dot com by Friday.
```

Audio files: `wavs/clip_0001.wav`, `wavs/clip_0002.wav`, etc.
- Format: mono WAV, **22050 Hz**, 1–24 seconds per clip

### Resume training from this checkpoint

```python
# In your train script, point the base model at this checkpoint:
XTTS_CHECKPOINT = "best_model.pth"     # this file
XTTS_CONFIG     = "config.json"        # this file

# Then run the standard XTTS v2 training recipe:
# TTS/recipes/ljspeech/xtts_v2/train_gpt_xtts.py
# with formatter="thorsten" for 2-column id|text metadata
```

Key config values used during training (from `config.json`):

| Parameter | Value |
|---|---|
| `lr` | `5e-6` |
| `batch_size` | `2` (per GPU, 2× GPUs) |
| `eval_split_size` | `0.1` (10% held out) |
| `eval_split_max_size` | `80` clips |
| `save_step` | `500` |
| `lr_scheduler` | `MultiStepLR` |
| `lr_scheduler milestones` | `[3500, 7000, 10000]` |
| `lr_scheduler gamma` | `0.5` (halved at each milestone) |
| `gpt_cond_len` | `12` seconds |
| `gpt_cond_chunk_len` | `4` seconds |
| `gpt_max_audio_tokens` | `605` |
| `gpt_max_text_tokens` | `402` |
| `distributed_backend` | `nccl` (multi-GPU DDP) |

### Tips for further fine-tuning

- **Lower the LR** when resuming — the model is already converged, use `lr: 1e-6` or `2e-6`
- **More data is better** — 1000+ clips is the sweet spot; diminishing returns above 5000
- **Watch eval mel loss** — should stay below 2.7; if it climbs, you're overfitting
- **Reference audio matters more than model** — use a very clean 8-second clip for best voice cloning results
- **OOM during training** — reduce `max_wav_len` to `220500` (10 sec) or lower `batch_size` to 1

---

## Training History

| Run | Steps | Best eval loss | Notes |
|---|---|---|---|
| Run 1 (Mar 17) | ~1,650 | 2.735 | Initial run, still converging |
| Run 2 (Mar 18) | ~5,085 | 2.794 | Different config, higher loss |
| **Run 3 (Mar 18)** | **11,074** | **2.697** | **This checkpoint — best overall** |
| Run 4 (Mar 21) | ~6,885 | 2.829 | Interrupted, worse result |

Run 3 achieved the lowest eval loss across all experiments and is the checkpoint published here.

---

## Credits

- **[Coqui TTS / XTTS v2](https://github.com/coqui-ai/TTS)** — Base model and training framework (MIT License)
- **[Thorsten Müller](https://github.com/thorstenMueller)** — Dataset format convention (`id|text` two-column pipe-delimited)

---

## License

The fine-tuned weights follow the [Coqui Public Model License (CPML)](https://coqui.ai/cpml) of the base XTTS v2 model.