# Sinhala TTS VITS — Developer Guide 🗣️🇱🇰

Comprehensive guide for training, fine-tuning, and deploying the Sinhala Text-to-Speech VITS model.

---

## Table of Contents

1. [Setup](#1-setup)
2. [Project Structure](#2-project-structure)
3. [Training from Scratch](#3-training-from-scratch)
4. [Fine-Tuning on Custom Data](#4-fine-tuning-on-custom-data)
5. [Dataset Preparation](#5-dataset-preparation)
6. [Cloud Platform Setup](#6-cloud-platform-setup)
7. [Export to SafeTensors](#7-export-to-safetensors)
8. [Deployment Options](#8-deployment-options)
9. [Troubleshooting](#9-troubleshooting)

---

## 1. Setup

### Local Environment

```bash
# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# .\venv\Scripts\activate  # Windows

# Install Coqui TTS
pip install coqui-tts

# Install extras for development
pip install numpy soundfile safetensors fastapi uvicorn pydantic

# Verify
python -c "from TTS.tts.models.vits import Vits; print('VITS loaded')"
```

### Hardware Requirements

| Task | Min RAM | GPU VRAM | Storage | Time |
|------|---------|----------|---------|------|
| Inference | 4 GB | 2 GB (or CPU) | 500 MB | Real-time |
| Fine-tuning | 8 GB | 4 GB (T4/P100) | 2 GB | 1-2 hours |
| Training from scratch | 16 GB | 8 GB (A100) | 10 GB | 8-24 hours |

### Sinhala Character Vocabulary

The model uses a **93-character vocabulary**: 74 Sinhala Unicode characters + 19 punctuation marks.

```python
sinhala_chars = "ංඃඅආඇඈඉඊඋඌඍඑඒඓඔඕඖකඛගඝඞඟචඡජඣඤඥටඨඩඪඬතථදධනඳපඵබභමඹයරලවශෂසහළෆ්ාැෑිීුූෘෙේෛොෝෞෟෲ"
punctuations = " !'(),-.:;=?[]\u200d\u2018\u2019\u201c\u201d"
```

---

## 2. Project Structure

```
├── config.json                    # Model configuration
├── speakers.json                  # Speaker ID mapping
├── sinhala_tts_vits_model.safetensors  # Model weights
├── app.py                         # Gradio demo (HF Space)
├── server.py                      # FastAPI REST server
├── Dockerfile                     # Production Docker build
├── DEVELOPER_GUIDE.md             # This guide
├── README.md                      # Model card
├── retrain_vits.py                # Training script
├── training_vits/
│   ├── output/                    # Training output + checkpoints
│   └── sinhala_data/              # Training audio + metadata
└── outputs/
    └── export_safetensors.py      # SafeTensors conversion script
```

---

## 3. Training from Scratch

### Step 1: Prepare Dataset

Create a metadata CSV with columns `audio_file,text,speaker`:

```bash
data/
├── speaker_1/
│   ├── utterance_001.wav
│   ├── utterance_002.wav
│   └── ...
├── speaker_2/
│   └── ...
└── metadata.csv
```

### Step 2: Configure Training

Edit `config.json` key parameters:

```json
{
  "model_args": {
    "num_chars": 97,
    "num_speakers": 16,
    "speaker_embedding_channels": 256
  },
  "audio": {
    "sample_rate": 16000,
    "fft_size": 1024,
    "hop_length": 256,
    "num_mels": 80
  },
  "characters": {
    "characters_class": "TTS.tts.utils.text.characters.Graphemes",
    "characters": "ංඃඅ...",  // Full Sinhala vocab
    "punctuations": " !'(),-.:;=?[]\u200d..."
  },
  "text_cleaner": "sinhala_cleaners",
  "epochs": 100,
  "batch_size": 8,
  "lr": 0.0002
}
```

### Step 3: Run Training

```python
# From retrain_vits.py — adapt paths as needed
python retrain_vits.py
```

Or use the Coqui TTS trainer directly:

```bash
python -m TTS.tts.train \
    --model vits \
    --config_path config.json \
    --output_path ./training_output \
    --dataset_path ./data \
    --restore_path ./checkpoint.pth
```

### Step 4: Monitor Training

```bash
# View training logs
tensorboard --logdir ./training_output

# Check generated samples
ls ./training_output/*.wav
```

---

## 4. Fine-Tuning on Custom Data

### Adding a New Speaker

1. **Prepare audio data**: 5-10 minutes of clean speech, split into 3-10 second clips
2. **Create speaker entry**: Add to `speakers.json`
3. **Add metadata**: Include in training manifest
4. **Fine-tune with fewer epochs**:

```python
# Load pre-trained model
from TTS.tts.models.vits import Vits
from TTS.tts.configs.vits_config import VitsConfig

config = VitsConfig()
config.load_json("config.json")
model = Vits.init_from_config(config)

# Load pre-trained weights
from safetensors.torch import load_file
model.load_state_dict(load_file("sinhala_tts_vits_model.safetensors"), strict=False)

# Fine-tune for 20-50 epochs with lower learning rate
config.epochs = 50
config.lr = 0.00005  # Lower LR for fine-tuning
# ... run training
```

### Voice Adaptation Tips

- Use high-quality audio (16 kHz minimum, 24 kHz preferred)
- Ensure consistent recording environment (minimal background noise)
- Trim silence from beginning and end of clips
- Normalize volume levels across all samples
- Minimum 50 utterances per new speaker for best results

---

## 5. Dataset Preparation

### Audio Requirements

| Parameter | Value |
|-----------|-------|
| Sample Rate | 16 kHz (minimum) |
| Format | WAV (PCM 16-bit) |
| Duration | 1-15 seconds per clip |
| SNR | > 20 dB |
| Language | Sinhala (සිංහල) |

### Preprocessing Pipeline

```python
import librosa
import soundfile as sf
import numpy as np

def preprocess_audio(input_path: str, output_path: str):
    """Resample, normalize, trim silence."""
    audio, sr = librosa.load(input_path, sr=16000, mono=True)

    # Trim leading/trailing silence
    audio, _ = librosa.effects.trim(audio, top_db=25)

    # Normalize peak amplitude
    peak = np.max(np.abs(audio))
    if peak > 0:
        audio = audio / peak * 0.95

    # Save
    sf.write(output_path, audio, 16000)

def prepare_metadata(audio_dir: str, output_csv: str):
    """Create training metadata from audio file directory."""
    import csv, os
    with open(output_csv, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['audio_file', 'text', 'speaker'])
        for speaker_dir in os.listdir(audio_dir):
            speaker_path = os.path.join(audio_dir, speaker_dir)
            if not os.path.isdir(speaker_path):
                continue
            for wav in os.listdir(speaker_path):
                if wav.endswith('.wav'):
                    txt_file = wav.replace('.wav', '.txt')
                    txt_path = os.path.join(speaker_path, txt_file)
                    if os.path.exists(txt_path):
                        with open(txt_path, 'r') as tf:
                            text = tf.read().strip()
                        writer.writerow([
                            os.path.join(speaker_dir, wav),
                            text,
                            speaker_dir
                        ])
```

---

## 6. Cloud Platform Setup

### Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github)

```python
# Colab setup
!pip install coqui-tts torch soundfile safetensors

# Clone model files
from huggingface_hub import snapshot_download
snapshot_download(repo_id="deathlegionteam/sinhala-tts-vits",
                  local_dir="./sinhala_tts")

# Load and synthesize
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.models.vits import Vits
# ... (same as usage section)
```

**T4 GPU**: ~2 hours for fine-tuning (50 epochs, batch_size=4)

### Kaggle

```python
# Kaggle notebook setup
!pip install coqui-tts -q

# Download model
!wget https://huggingface.co/deathlegionteam/sinhala-tts-vits/resolve/main/config.json
!wget https://huggingface.co/deathlegionteam/sinhala-tts-vits/resolve/main/sinhala_tts_vits_model.safetensors
!wget https://huggingface.co/deathlegionteam/sinhala-tts-vits/resolve/main/speakers.json
```

**P100 GPU**: ~30 hrs/week free quota, good for full training runs

### Modal

```python
# modal_train.py
import modal

app = modal.App("sinhala-tts-training")

image = modal.Image.debian_slim().pip_install(
    "coqui-tts", "torch", "soundfile", "safetensors"
)

@app.function(gpu="A100", image=image, timeout=86400)
def train():
    # Training code here
    pass
```

**A100 80GB**: $20 free credit, ideal for full training from scratch

### RunPod

```bash
# Use template: PyTorch 2.0+ with CUDA 12.x
podman run -it --gpus all \
    -v ./data:/workspace/data \
    runpod/pytorch:latest \
    bash -c "pip install coqui-tts && python train.py"
```

**RTX 4090**: ~$0.34/hr, good price/performance for fine-tuning

---

## 7. Export to SafeTensors

```python
"""convert_checkpoint.py — Export PyTorch checkpoint to SafeTensors"""
import torch
from safetensors.torch import save_file
from pathlib import Path

def export_safetensors(checkpoint_path: str, output_path: str):
    """Convert a .pth checkpoint to .safetensors format."""
    ckpt = torch.load(checkpoint_path, map_location="cpu", weights_only=False)

    # Extract model state dict (handle trainer wrappers)
    if "model_state_dict" in ckpt:
        state_dict = ckpt["model_state_dict"]
    elif "module" in ckpt:
        state_dict = ckpt["module"]
    else:
        state_dict = ckpt

    # Remove module prefix if present (from DataParallel/DDP)
    cleaned = {}
    for key, value in state_dict.items():
        new_key = key.replace("module.", "", 1) if key.startswith("module.") else key
        cleaned[new_key] = value.contiguous()

    save_file(cleaned, output_path)
    print(f"Exported {len(cleaned)} tensors to {output_path}")
    print(f"File size: {Path(output_path).stat().st_size / 1e6:.1f} MB")

if __name__ == "__main__":
    export_safetensors("vits_sinhala_epoch10.pth", "sinhala_tts_vits_model.safetensors")
```

---

## 8. Deployment Options

### Option A: HuggingFace Space (Gradio UI)

```bash
# Already deployed at:
# https://huggingface.co/spaces/deathlegionteam/sinhala-tts-demo

# To update:
python -c "
from huggingface_hub import HfApi
api = HfApi(token='YOUR_HF_TOKEN')
api.upload_file(
    path_or_fileobj='app.py',
    path_in_repo='app.py',
    repo_id='deathlegionteam/sinhala-tts-demo',
    repo_type='space'
)
"
```

### Option B: FastAPI + Docker

```bash
# Build and run locally
docker build -t sinhala-tts-server .
docker run -p 8081:8081 sinhala-tts-server

# Test API
curl -X POST http://localhost:8081/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "ආයුබෝවන්!", "speaker": "mettananda"}' \
  --output output.wav
```

### Option C: Kubernetes

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sinhala-tts
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sinhala-tts
  template:
    metadata:
      labels:
        app: sinhala-tts
    spec:
      containers:
      - name: server
        image: sinhala-tts-server:latest
        ports:
        - containerPort: 8081
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
---
apiVersion: v1
kind: Service
metadata:
  name: sinhala-tts-service
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 8081
  selector:
    app: sinhala-tts
```

---

## 9. Troubleshooting

### Common Issues

| Problem | Cause | Solution |
|---------|-------|----------|
| `Infinity is not valid JSON` | `max_audio_len: Infinity` in config.json | Replace `Infinity` with `1e9` |
| `Configuration Parsing Warning` | Missing `library_name: coqui-tts` in YAML | Add to model card header |
| Model produces silence | Vocab mismatch (default Coqui uses 78 chars) | Set `characters` with full Sinhala vocab |
| `isin_mps_friendly` error | `transformers>=5.0` removed this attribute | Pin `transformers<5` |
| Audio too slow/fast | `length_scale` parameter | Adjust `model_args.length_scale` (1.0=normal) |
| Out of memory | Batch size too large | Reduce `batch_size` or use gradient accumulation |
| Speaker not found | Speaker name not in `speakers.json` | Verify speaker name matches exactly |
| CUDA out of memory | GPU VRAM insufficient | Use CPU: `model.to('cpu')` or reduce model size |

### Debugging Tips

```python
# Check model output shape
outputs = model.synthesize(text, config=config, speaker=speaker)
print(outputs.keys())     # Available outputs
print(outputs['wav'].shape)  # Audio waveform shape
print(outputs['wav'].min(), outputs['wav'].max())  # Value range

# Verify tokenizer encoding
tokenizer, config = TTSTokenizer.init_from_config(config)
ids = tokenizer.encode(text)
print(f"Text IDs: {ids}")
print(f"Vocab size: {tokenizer.characters.num_chars}")

# Check if audio is all zeros
import numpy as np
wav = outputs['wav']
if np.abs(wav).max() < 1e-6:
    print("WARNING: Audio is silent (all zeros)")
```

### Getting Help

- Open an issue on [HuggingFace Model Repo](https://huggingface.co/deathlegionteam/sinhala-tts-vits)
- Check [Coqui TTS Docs](https://github.com/coqui-ai/TTS)
- Join the [Coqui Discord](https://discord.gg/coqui-ai)

---

## License

Apache 2.0 — See [LICENSE](./LICENSE) for details.

## Maintainer

**Death Legion Team** — [🤗 HuggingFace](https://huggingface.co/deathlegionteam)