--- language: si license: apache-2.0 library_name: coqui-tts pipeline_tag: text-to-speech inference: false tags: - text-to-speech - sinhala - tts - vits - coqui-tts - speech-synthesis datasets: - sinhala-tts metrics: - mos --- # 🗣️ Sinhala TTS VITS 🇱🇰 **Sinhala Text-to-Speech** — A [Coqui TTS](https://github.com/coqui-ai/TTS) VITS model that generates natural Sinhala speech from text, with **16 distinct voices** to choose from. ## 🎯 Model Details | Attribute | Value | |-----------|-------| | **Architecture** | VITS (Variational Inference Text-to-Speech) | | **Language** | 🇱🇰 Sinhala (සිංහල) | | **Speakers** | 16 voices | | **Sample Rate** | 16 kHz | | **Parameters** | ~30M | | **Vocab** | 97 characters (74 Sinhala Unicode + 19 punctuation + 4 special tokens) | | **Framework** | [Coqui TTS](https://github.com/coqui-ai/TTS) 0.27.x | | **License** | Apache 2.0 | | **Model Format** | SafeTensors (.safetensors) | ## 🗣️ Available Speakers | ID | Speaker Name | Description | |----|-------------|-------------| | 0 | **mettananda** | Male voice 1 | | 1 | **oshadi** | Female voice 1 | | 2 | **pn_sin_01** | Voice 3 | | 3 | **sin_01** | Voice 4 | | 4 | **sin_2241** | Voice 5 | | 5 | **sin_2282** | Voice 6 | | 6 | **sin_3531** | Voice 7 | | 7 | **sin_3688** | Voice 8 | | 8 | **sin_3976** | Voice 9 | | 9 | **sin_4191** | Voice 10 | | 10 | **sin_4499** | Voice 11 | | 11 | **sin_5681** | Voice 12 | | 12 | **sin_6314** | Voice 13 | | 13 | **sin_6897** | Voice 14 | | 14 | **sin_7183** | Voice 15 | | 15 | **sin_9228** | Voice 16 | ## 🚀 Usage ### Option 1: Coqui TTS (Recommended) ```python import torch from TTS.tts.configs.vits_config import VitsConfig from TTS.tts.models.vits import Vits from TTS.tts.utils.text import TTSTokenizer from TTS.tts.utils.speakers import SpeakerManager from TTS.utils.audio import AudioProcessor # Load config config = VitsConfig() config.load_json("config.json") # Initialize components ap = AudioProcessor.init_from_config(config) tokenizer, new_config = TTSTokenizer.init_from_config(config) speaker_manager = SpeakerManager() speaker_manager.load_ids_from_file("speakers.json") # Create and load model model = Vits(new_config, ap, tokenizer, speaker_manager) from safetensors.torch import load_file state_dict = load_file("sinhala_tts_vits_model.safetensors") model.load_state_dict(state_dict, strict=False) model.eval() # Synthesize text = "ආයුබෝවන්! ඔබට කොහොමද?" outputs = model.synthesize(text, config=new_config, speaker="mettananda") # Save audio import soundfile as sf sf.write("output.wav", outputs["wav"], 16000) ``` ### Option 2: REST API (with included server.py) ```bash # Start the server python server.py # Generate speech curl -X POST http://localhost:8081/tts \ -H "Content-Type: application/json" \ -d '{ "text": "ආයුබෝවන්!", "speaker": "mettananda", "emotion": "neutral" }' \ --output output.wav # Health check curl http://localhost:8081/health # List speakers curl http://localhost:8081/speakers ``` ### Option 3: HuggingFace Inference API > ⚠️ This model uses Coqui TTS (not Transformers) and cannot be used via the standard HF Inference API. Use Coqui TTS directly or the included REST API server. ### Option 4: Docker Deployment ```bash docker build -t sinhala-tts-server . docker run -p 8081:8081 sinhala-tts-server ``` ## 🛠️ Development Platforms | Platform | GPU | Cost | Best For | |----------|-----|------|----------| | [![Kaggle](https://img.shields.io/badge/Kaggle-20BEFF?logo=kaggle&logoColor=white)](https://kaggle.com) | P100/T4 | Free (~30 hrs/week) | Quick experiments | | [![Colab](https://img.shields.io/badge/Colab-F9AB00?logo=googlecolab&logoColor=white)](https://colab.research.google.com) | T4/A100 | Free / $10/mo Pro | Training runs | | [![Modal](https://img.shields.io/badge/Modal-1D2C3E?logo=modal&logoColor=white)](https://modal.com) | A100 80GB | $20 free credit | Full training | | [![RunPod](https://img.shields.io/badge/RunPod-6C1EE7?logo=runpod&logoColor=white)](https://runpod.io) | RTX 4090/A100 | $0.34–$2.00/hr | Production | ## 📦 Files | File | Description | Size | |------|-------------|------| | `sinhala_tts_vits_model.safetensors` | Model weights (SafeTensors) | 316 MB | | `config.json` | Model configuration | 8 KB | | `speakers.json` | Speaker ID mapping | 300 B | | `server.py` | FastAPI REST inference server | 6 KB | | `Dockerfile` | Docker build for production | 2 KB | | `DEVELOPER_GUIDE.md` | Training & development guide | 15 KB | ## 🎓 Training & Fine-Tuning For detailed instructions, see the [DEVELOPER_GUIDE.md](./DEVELOPER_GUIDE.md) which covers: - **Setup**: Environment configuration and dependency installation - **Training from scratch**: Full training pipeline with the Sinhala dataset - **Fine-tuning**: Adapting the model to new voices or domains - **Dataset preparation**: Preprocessing Sinhala audio data - **Export to SafeTensors**: Converting PyTorch checkpoints to SafeTensors format - **Cloud GPU training**: Step-by-step guides for Kaggle, Colab, and Modal ## 🌐 Deployment Options | Method | Description | Best For | |--------|-------------|----------| | **HuggingFace Space** | Gradio web UI (live demo) | Quick testing | | **FastAPI Server** | REST API with Docker | Production APIs | | **Local Python** | Direct model loading | Development | | **Kubernetes** | Docker container in K8s | Scalable deployment | ## ⚠️ Limitations - **Audio quality**: Trained on a limited dataset (~200 samples × 16 speakers) — quality may vary - **Inference speed**: CPU inference is slower; GPU recommended for production - **Emotion control**: Basic emotion prefixes are supported but effects are subtle - **Proper nouns**: May struggle with non-Sinhala words or names - **Out-of-vocabulary characters**: Limited to the 93-character vocabulary ## 📝 License This model is released under the **Apache 2.0 License**. ## 🙏 Maintainer **Death Legion Team** — [🤗 HuggingFace](https://huggingface.co/deathlegionteam) ---

🎧 Try the Live Demo📖 Developer Guide🏠 Death Legion Team