Instructions to use mashey/mms-tts-div-finetuned-md-m01 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mashey/mms-tts-div-finetuned-md-m01 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-audio", model="mashey/mms-tts-div-finetuned-md-m01")# Load model directly from transformers import AutoTokenizer, AutoModelForTextToWaveform tokenizer = AutoTokenizer.from_pretrained("mashey/mms-tts-div-finetuned-md-m01") model = AutoModelForTextToWaveform.from_pretrained("mashey/mms-tts-div-finetuned-md-m01") - Notebooks
- Google Colab
- Kaggle
metadata
library_name: transformers
tags:
- dhivehi-tts
license: mit
datasets:
- alakxender/dv_syn_speech_md
language:
- dv
base_model:
- facebook/mms-tts-div
Divehi TTS – Male Voice (VITS-based)
This is a fine-tuned VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) model for Divehi speech synthesis. The model produces Male voice audio from Thaana-scripted Divehi text. Fine-tuned from Meta’s MMS-TTS architecture using a curated dataset of synthetic Divehi speech.
Model Details
| Field | Value |
|---|---|
| Model ID | alakxender/mms-tts-div-finetuned-md-m01 |
| Base Architecture | MMS-TTS (VITS) |
| Language | Divehi (dv) |
| Voice | Male |
| Sampling Rate | 16 kHz |
| Tokenizer | VITSTokenizer |
| Inference Engine | Transformers (🤗 Hugging Face) |
Usage
from transformers import VitsModel, VitsTokenizer
import torchaudio
tokenizer = VitsTokenizer.from_pretrained("alakxender/mms-tts-div-finetuned-md-m01")
model = VitsModel.from_pretrained("alakxender/mms-tts-div-finetuned-md-m01")
text = "މޫސުން ވަރަށް ގޯސްވެ، ފުވައްމުލަކުން ފެށިގެން އައްޑުއަށް އޮރެންޖް އެލާޓް ނެރެފި"
inputs = tokenizer(text, return_tensors="pt")
waveform = model.generate(**inputs).waveform[0]
torchaudio.save("output.wav", waveform.unsqueeze(0), 16000)
Evaluation Summary
- Model:
alakxender/mms-tts-div-finetuned-md-m01 - Evaluated Samples: 3
- Avg Estimated MOS (UTMOS):
3.228{ "5": "Excellent (very natural)", "4": "Good (mostly natural)", "3": "Fair (some robotic quality)", "2": "Poor (noticeably unnatural)", "1": "Bad (unintelligible or very synthetic)" } - Artifacts:
- 🎵 Audio:
outputs/audio/ - 📊 Spectrograms:
outputs/spectrograms/ - 📄 Report:
outputs/report.txt - 📈 MOS Scores:
outputs/mos_scores.txt
- 🎵 Audio: