# Sinhala TTS VITS — Developer Guide 🗣️🇱🇰 Comprehensive guide for training, fine-tuning, and deploying the Sinhala Text-to-Speech VITS model. --- ## Table of Contents 1. [Setup](#1-setup) 2. [Project Structure](#2-project-structure) 3. [Training from Scratch](#3-training-from-scratch) 4. [Fine-Tuning on Custom Data](#4-fine-tuning-on-custom-data) 5. [Dataset Preparation](#5-dataset-preparation) 6. [Cloud Platform Setup](#6-cloud-platform-setup) 7. [Export to SafeTensors](#7-export-to-safetensors) 8. [Deployment Options](#8-deployment-options) 9. [Troubleshooting](#9-troubleshooting) --- ## 1. Setup ### Local Environment ```bash # Create virtual environment python -m venv venv source venv/bin/activate # Linux/Mac # .\venv\Scripts\activate # Windows # Install Coqui TTS pip install coqui-tts # Install extras for development pip install numpy soundfile safetensors fastapi uvicorn pydantic # Verify python -c "from TTS.tts.models.vits import Vits; print('VITS loaded')" ``` ### Hardware Requirements | Task | Min RAM | GPU VRAM | Storage | Time | |------|---------|----------|---------|------| | Inference | 4 GB | 2 GB (or CPU) | 500 MB | Real-time | | Fine-tuning | 8 GB | 4 GB (T4/P100) | 2 GB | 1-2 hours | | Training from scratch | 16 GB | 8 GB (A100) | 10 GB | 8-24 hours | ### Sinhala Character Vocabulary The model uses a **93-character vocabulary**: 74 Sinhala Unicode characters + 19 punctuation marks. ```python sinhala_chars = "ංඃඅආඇඈඉඊඋඌඍඑඒඓඔඕඖකඛගඝඞඟචඡජඣඤඥටඨඩඪඬතථදධනඳපඵබභමඹයරලවශෂසහළෆ්ාැෑිීුූෘෙේෛොෝෞෟෲ" punctuations = " !'(),-.:;=?[]\u200d\u2018\u2019\u201c\u201d" ``` --- ## 2. Project Structure ``` ├── config.json # Model configuration ├── speakers.json # Speaker ID mapping ├── sinhala_tts_vits_model.safetensors # Model weights ├── app.py # Gradio demo (HF Space) ├── server.py # FastAPI REST server ├── Dockerfile # Production Docker build ├── DEVELOPER_GUIDE.md # This guide ├── README.md # Model card ├── retrain_vits.py # Training script ├── training_vits/ │ ├── output/ # Training output + checkpoints │ └── sinhala_data/ # Training audio + metadata └── outputs/ └── export_safetensors.py # SafeTensors conversion script ``` --- ## 3. Training from Scratch ### Step 1: Prepare Dataset Create a metadata CSV with columns `audio_file,text,speaker`: ```bash data/ ├── speaker_1/ │ ├── utterance_001.wav │ ├── utterance_002.wav │ └── ... ├── speaker_2/ │ └── ... └── metadata.csv ``` ### Step 2: Configure Training Edit `config.json` key parameters: ```json { "model_args": { "num_chars": 97, "num_speakers": 16, "speaker_embedding_channels": 256 }, "audio": { "sample_rate": 16000, "fft_size": 1024, "hop_length": 256, "num_mels": 80 }, "characters": { "characters_class": "TTS.tts.utils.text.characters.Graphemes", "characters": "ංඃඅ...", // Full Sinhala vocab "punctuations": " !'(),-.:;=?[]\u200d..." }, "text_cleaner": "sinhala_cleaners", "epochs": 100, "batch_size": 8, "lr": 0.0002 } ``` ### Step 3: Run Training ```python # From retrain_vits.py — adapt paths as needed python retrain_vits.py ``` Or use the Coqui TTS trainer directly: ```bash python -m TTS.tts.train \ --model vits \ --config_path config.json \ --output_path ./training_output \ --dataset_path ./data \ --restore_path ./checkpoint.pth ``` ### Step 4: Monitor Training ```bash # View training logs tensorboard --logdir ./training_output # Check generated samples ls ./training_output/*.wav ``` --- ## 4. Fine-Tuning on Custom Data ### Adding a New Speaker 1. **Prepare audio data**: 5-10 minutes of clean speech, split into 3-10 second clips 2. **Create speaker entry**: Add to `speakers.json` 3. **Add metadata**: Include in training manifest 4. **Fine-tune with fewer epochs**: ```python # Load pre-trained model from TTS.tts.models.vits import Vits from TTS.tts.configs.vits_config import VitsConfig config = VitsConfig() config.load_json("config.json") model = Vits.init_from_config(config) # Load pre-trained weights from safetensors.torch import load_file model.load_state_dict(load_file("sinhala_tts_vits_model.safetensors"), strict=False) # Fine-tune for 20-50 epochs with lower learning rate config.epochs = 50 config.lr = 0.00005 # Lower LR for fine-tuning # ... run training ``` ### Voice Adaptation Tips - Use high-quality audio (16 kHz minimum, 24 kHz preferred) - Ensure consistent recording environment (minimal background noise) - Trim silence from beginning and end of clips - Normalize volume levels across all samples - Minimum 50 utterances per new speaker for best results --- ## 5. Dataset Preparation ### Audio Requirements | Parameter | Value | |-----------|-------| | Sample Rate | 16 kHz (minimum) | | Format | WAV (PCM 16-bit) | | Duration | 1-15 seconds per clip | | SNR | > 20 dB | | Language | Sinhala (සිංහල) | ### Preprocessing Pipeline ```python import librosa import soundfile as sf import numpy as np def preprocess_audio(input_path: str, output_path: str): """Resample, normalize, trim silence.""" audio, sr = librosa.load(input_path, sr=16000, mono=True) # Trim leading/trailing silence audio, _ = librosa.effects.trim(audio, top_db=25) # Normalize peak amplitude peak = np.max(np.abs(audio)) if peak > 0: audio = audio / peak * 0.95 # Save sf.write(output_path, audio, 16000) def prepare_metadata(audio_dir: str, output_csv: str): """Create training metadata from audio file directory.""" import csv, os with open(output_csv, 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['audio_file', 'text', 'speaker']) for speaker_dir in os.listdir(audio_dir): speaker_path = os.path.join(audio_dir, speaker_dir) if not os.path.isdir(speaker_path): continue for wav in os.listdir(speaker_path): if wav.endswith('.wav'): txt_file = wav.replace('.wav', '.txt') txt_path = os.path.join(speaker_path, txt_file) if os.path.exists(txt_path): with open(txt_path, 'r') as tf: text = tf.read().strip() writer.writerow([ os.path.join(speaker_dir, wav), text, speaker_dir ]) ``` --- ## 6. Cloud Platform Setup ### Google Colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github) ```python # Colab setup !pip install coqui-tts torch soundfile safetensors # Clone model files from huggingface_hub import snapshot_download snapshot_download(repo_id="deathlegionteam/sinhala-tts-vits", local_dir="./sinhala_tts") # Load and synthesize from TTS.tts.configs.vits_config import VitsConfig from TTS.tts.models.vits import Vits # ... (same as usage section) ``` **T4 GPU**: ~2 hours for fine-tuning (50 epochs, batch_size=4) ### Kaggle ```python # Kaggle notebook setup !pip install coqui-tts -q # Download model !wget https://huggingface.co/deathlegionteam/sinhala-tts-vits/resolve/main/config.json !wget https://huggingface.co/deathlegionteam/sinhala-tts-vits/resolve/main/sinhala_tts_vits_model.safetensors !wget https://huggingface.co/deathlegionteam/sinhala-tts-vits/resolve/main/speakers.json ``` **P100 GPU**: ~30 hrs/week free quota, good for full training runs ### Modal ```python # modal_train.py import modal app = modal.App("sinhala-tts-training") image = modal.Image.debian_slim().pip_install( "coqui-tts", "torch", "soundfile", "safetensors" ) @app.function(gpu="A100", image=image, timeout=86400) def train(): # Training code here pass ``` **A100 80GB**: $20 free credit, ideal for full training from scratch ### RunPod ```bash # Use template: PyTorch 2.0+ with CUDA 12.x podman run -it --gpus all \ -v ./data:/workspace/data \ runpod/pytorch:latest \ bash -c "pip install coqui-tts && python train.py" ``` **RTX 4090**: ~$0.34/hr, good price/performance for fine-tuning --- ## 7. Export to SafeTensors ```python """convert_checkpoint.py — Export PyTorch checkpoint to SafeTensors""" import torch from safetensors.torch import save_file from pathlib import Path def export_safetensors(checkpoint_path: str, output_path: str): """Convert a .pth checkpoint to .safetensors format.""" ckpt = torch.load(checkpoint_path, map_location="cpu", weights_only=False) # Extract model state dict (handle trainer wrappers) if "model_state_dict" in ckpt: state_dict = ckpt["model_state_dict"] elif "module" in ckpt: state_dict = ckpt["module"] else: state_dict = ckpt # Remove module prefix if present (from DataParallel/DDP) cleaned = {} for key, value in state_dict.items(): new_key = key.replace("module.", "", 1) if key.startswith("module.") else key cleaned[new_key] = value.contiguous() save_file(cleaned, output_path) print(f"Exported {len(cleaned)} tensors to {output_path}") print(f"File size: {Path(output_path).stat().st_size / 1e6:.1f} MB") if __name__ == "__main__": export_safetensors("vits_sinhala_epoch10.pth", "sinhala_tts_vits_model.safetensors") ``` --- ## 8. Deployment Options ### Option A: HuggingFace Space (Gradio UI) ```bash # Already deployed at: # https://huggingface.co/spaces/deathlegionteam/sinhala-tts-demo # To update: python -c " from huggingface_hub import HfApi api = HfApi(token='YOUR_HF_TOKEN') api.upload_file( path_or_fileobj='app.py', path_in_repo='app.py', repo_id='deathlegionteam/sinhala-tts-demo', repo_type='space' ) " ``` ### Option B: FastAPI + Docker ```bash # Build and run locally docker build -t sinhala-tts-server . docker run -p 8081:8081 sinhala-tts-server # Test API curl -X POST http://localhost:8081/tts \ -H "Content-Type: application/json" \ -d '{"text": "ආයුබෝවන්!", "speaker": "mettananda"}' \ --output output.wav ``` ### Option C: Kubernetes ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: sinhala-tts spec: replicas: 2 selector: matchLabels: app: sinhala-tts template: metadata: labels: app: sinhala-tts spec: containers: - name: server image: sinhala-tts-server:latest ports: - containerPort: 8081 resources: requests: memory: "4Gi" cpu: "2" limits: memory: "8Gi" cpu: "4" --- apiVersion: v1 kind: Service metadata: name: sinhala-tts-service spec: type: LoadBalancer ports: - port: 80 targetPort: 8081 selector: app: sinhala-tts ``` --- ## 9. Troubleshooting ### Common Issues | Problem | Cause | Solution | |---------|-------|----------| | `Infinity is not valid JSON` | `max_audio_len: Infinity` in config.json | Replace `Infinity` with `1e9` | | `Configuration Parsing Warning` | Missing `library_name: coqui-tts` in YAML | Add to model card header | | Model produces silence | Vocab mismatch (default Coqui uses 78 chars) | Set `characters` with full Sinhala vocab | | `isin_mps_friendly` error | `transformers>=5.0` removed this attribute | Pin `transformers<5` | | Audio too slow/fast | `length_scale` parameter | Adjust `model_args.length_scale` (1.0=normal) | | Out of memory | Batch size too large | Reduce `batch_size` or use gradient accumulation | | Speaker not found | Speaker name not in `speakers.json` | Verify speaker name matches exactly | | CUDA out of memory | GPU VRAM insufficient | Use CPU: `model.to('cpu')` or reduce model size | ### Debugging Tips ```python # Check model output shape outputs = model.synthesize(text, config=config, speaker=speaker) print(outputs.keys()) # Available outputs print(outputs['wav'].shape) # Audio waveform shape print(outputs['wav'].min(), outputs['wav'].max()) # Value range # Verify tokenizer encoding tokenizer, config = TTSTokenizer.init_from_config(config) ids = tokenizer.encode(text) print(f"Text IDs: {ids}") print(f"Vocab size: {tokenizer.characters.num_chars}") # Check if audio is all zeros import numpy as np wav = outputs['wav'] if np.abs(wav).max() < 1e-6: print("WARNING: Audio is silent (all zeros)") ``` ### Getting Help - Open an issue on [HuggingFace Model Repo](https://huggingface.co/deathlegionteam/sinhala-tts-vits) - Check [Coqui TTS Docs](https://github.com/coqui-ai/TTS) - Join the [Coqui Discord](https://discord.gg/coqui-ai) --- ## License Apache 2.0 — See [LICENSE](./LICENSE) for details. ## Maintainer **Death Legion Team** — [🤗 HuggingFace](https://huggingface.co/deathlegionteam)