Kala — Nepali TTS v0.2
The first open-source Nepali TTS model built on a hand-crafted G2P — no eSpeak.
Kala is a multi-speaker VITS model trained with the real_nepali G2P frontend:
a rule-and-lexicon system grounded in Khatiwada 2009 and tuned to mainstream
Kathmandu Nepali phonology. The ONNX model runs on CPU in real time
(RTF ≈ 0.02 — 50× faster than real time on a laptop).
▶ Try it live: ampixa-real-nepali-tts.hf.space
Why a new G2P?
eSpeak-ng's ne voice was designed for phoneme coverage, not phonological
accuracy. It maps Nepali affricates to alveolar labels (ts, tsh) that do
not match how Kathmandu speakers produce च and छ. It silently loses gemination
and does not handle Latin code-switching at all.
The real_nepali frontend:
| Feature | eSpeak ne |
real_nepali |
|---|---|---|
| च / छ | ts / tsh (alveolar) |
ch / chh (palatal) |
| Gemination | often lost | explicit : tokens |
| Schwa deletion | heuristic | rule-based, audited |
| Latin code-switch | undefined | letter-by-letter or override lexicon |
| Phone inventory | ~35 | 48 phones + geminated variants |
| Lexicon | none | 48 000-entry curated lexicon |
On the NepTTS-Bench minimal-pairs test (365 sentences), the frontend reaches 99.5 % minimal-pair contrast preservation against the reference IPA transcriptions.
Available speakers
| Speaker | ID | Data type | Training hours |
|---|---|---|---|
kala |
2 | human studio | 0.37 h |
barsha |
1 | human recording | 1.62 h |
slr143_F |
3 | corpus (OpenSLR-143) | 1.01 h |
slr43_0546 |
4 | corpus (OpenSLR-43) | 0.62 h |
slr43_2099 |
5 | corpus (OpenSLR-43) | 0.51 h |
Recommended speaker: kala for demo and production use.
The corpus speakers (slr143_F, slr43_*) have good prosody but recording
conditions vary; barsha is the second-best human voice.
Quick start (Python)
pip install kala-tts
import kala_tts
# Returns WAV bytes (16-bit PCM, 22050 Hz mono)
wav = kala_tts.synthesize("नमस्कार, कसरी हुनुहुन्छ?", speaker="kala")
# Write directly to a file
kala_tts.synthesize_to_file(
"नेपाल सुन्दर देश हो।",
"output.wav",
speaker="kala",
)
# List available speakers
print(kala_tts.list_speakers())
# ('kala', 'barsha', 'slr143_F', 'slr43_0546', 'slr43_2099')
# CLI
kala-tts "नमस्कार, कसरी हुनुहुन्छ?" --speaker kala -o out.wav
kala-tts --list-speakers
The first call downloads the ONNX model (~60 MB) from this repo and caches
it locally via huggingface_hub.
Manual inference (no pip)
Download the ONNX and config files from this repo, then:
git clone https://github.com/Ampixa/nepa-newa-text-frontend
cd nepa-newa-text-frontend
pip install onnxruntime huggingface_hub numpy
python -m kala_tts "नमस्कार" -o out.wav
Or use piper directly:
pip install piper-tts
echo "नमस्कार, कसरी हुनुहुन्छ?" | \
piper --model real_nepali_v02_kala.fp32.onnx --speaker_id 2 --output_file out.wav
ONNX model details
| Property | Value |
|---|---|
| File | real_nepali_v02_kala.fp32.onnx |
| Format | FP32 ONNX (VITS encoder + decoder fused) |
| Sample rate | 22050 Hz |
| Inputs | input (int64 phone IDs), input_lengths, scales, sid |
| Speakers | 6 (use sid to select) |
| RTF on laptop CPU | ~0.02 (50× real-time) |
Training details
| Item | Value |
|---|---|
| Base checkpoint | piper-plus multilingual (302 MB) |
| Architecture | VITS + monotonic attention |
| Total training rows | 4 338 |
| Total training hours | 8.61 h |
| Training epochs | 1 000 |
| Framework | piper-plus (patched for Nepali) |
| Hardware | NVIDIA L40S 46 GB |
Checkpoint SHA-256:
2b36b27f42e8549658676f953704573a31e2155fc95ec5d6407561e9fc4797fa
Training data
| Speaker | Source | Rows | Hours | License |
|---|---|---|---|---|
algenib |
Gemini-Flash synthetic (excluded from v0.2 public release) | 1 984 | 4.47 h | internal |
barsha |
Human recital | 808 | 1.62 h | CC-BY-SA-4.0 |
kala |
Human studio | 200 | 0.37 h | CC-BY-SA-4.0 |
slr143_F |
OpenSLR-143 | 566 | 1.01 h | CC-BY-SA-4.0 |
slr43_0546 |
OpenSLR-43 | 505 | 0.62 h | CC-BY-SA-4.0 |
slr43_2099 |
OpenSLR-43 | 275 | 0.51 h | CC-BY-SA-4.0 |
Known limitations
- Naturalness gap: Trained on only 200 utterances for the Kala voice; prosody can be flat on long sentences.
- Punctuation awareness: Periods, commas, and question marks are handled via deterministic pause insertion — the model does not learn intonation contours from punctuation tokens.
- OOV words: Unknown Devanagari words fall back to letter-by-letter rules. The 48 000-entry lexicon covers ~95% of common vocabulary.
- Numbers: Digits are read in Nepali word order; mixed Nepali/English numerals may produce unexpected output.
Citation
@misc{ampixa2026kala,
title = {Kala: CPU-native Nepali Text-to-Speech with a hand-crafted G2P},
author = {Ampixa},
year = {2026},
url = {https://huggingface.co/ampixa/real-nepali-v0.2-kala},
}
Phonological foundation: Khatiwada (2009), Nepali, Journal of the International Phonetic Association, 39(3), 373–380.
License
Model weights and code: CC-BY-SA 4.0
Training corpus (OpenSLR-143, OpenSLR-43): CC-BY-SA 4.0
G2P lexicon seed (google/language-resources ne/): CC-BY 4.0
- Downloads last month
- 100