Instructions to use mashey/mms-tts-div-finetuned-md-m02 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mashey/mms-tts-div-finetuned-md-m02 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-audio", model="mashey/mms-tts-div-finetuned-md-m02")# Load model directly from transformers import AutoTokenizer, AutoModelForTextToWaveform tokenizer = AutoTokenizer.from_pretrained("mashey/mms-tts-div-finetuned-md-m02") model = AutoModelForTextToWaveform.from_pretrained("mashey/mms-tts-div-finetuned-md-m02") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| tags: | |
| - dhivehi-tts | |
| license: mit | |
| datasets: | |
| - alakxender/dv_syn_speech_md | |
| language: | |
| - dv | |
| base_model: | |
| - facebook/mms-tts-div | |
| # Divehi TTS – Male Voice (VITS-based) | |
| This is a fine-tuned VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) model for Divehi speech synthesis. The model produces Male voice audio from Thaana-scripted Divehi text. Fine-tuned from Meta’s MMS-TTS architecture using a curated dataset of synthetic Divehi speech. | |
| ## Model Details | |
| | Field | Value | | |
| |----------------------|-------------------------------------------------| | |
| | **Model ID** | `alakxender/mms-tts-div-finetuned-md-m02` | | |
| | **Base Architecture**| MMS-TTS (VITS) | | |
| | **Language** | Divehi (dv) | | |
| | **Voice** | Male | | |
| | **Sampling Rate** | 16 kHz | | |
| | **Tokenizer** | VITSTokenizer | | |
| | **Inference Engine** | Transformers (🤗 Hugging Face) | | |
| ## Usage | |
| ```python | |
| from transformers import VitsModel, VitsTokenizer | |
| import torchaudio | |
| tokenizer = VitsTokenizer.from_pretrained("alakxender/mms-tts-div-finetuned-md-m02") | |
| model = VitsModel.from_pretrained("alakxender/mms-tts-div-finetuned-md-m02") | |
| text = "މޫސުން ވަރަށް ގޯސްވެ، ފުވައްމުލަކުން ފެށިގެން އައްޑުއަށް އޮރެންޖް އެލާޓް ނެރެފި" | |
| inputs = tokenizer(text, return_tensors="pt") | |
| waveform = model.generate(**inputs).waveform[0] | |
| torchaudio.save("output.wav", waveform.unsqueeze(0), 16000) | |
| ``` | |
| ## Evaluation Summary | |
| - **Model**: `alakxender/mms-tts-div-finetuned-md-m02` | |
| - **Evaluated Samples**: 3 | |
| - **Avg Estimated MOS (UTMOS)**: `2.926` | |
| ```json | |
| { | |
| "5": "Excellent (very natural)", | |
| "4": "Good (mostly natural)", | |
| "3": "Fair (some robotic quality)", | |
| "2": "Poor (noticeably unnatural)", | |
| "1": "Bad (unintelligible or very synthetic)" | |
| } | |
| ``` | |
| - **Artifacts**: | |
| - 🎵 Audio: `outputs/audio/` | |
| - 📊 Spectrograms: `outputs/spectrograms/` | |
| - 📄 Report: `outputs/report.txt` | |
| - 📈 MOS Scores: `outputs/mos_scores.txt` | |
| ## Acknowledgements | |
| - [Meta MMS-TTS](https://github.com/facebookresearch/fairseq/tree/main/examples/mms) | |
| - [Tarepan's SpeechMOS](https://github.com/Tarepan/SpeechMOS) | |
| - [Hugging Face 🤗 Transformers](https://huggingface.co/transformers/) | |