Kala — Nepali TTS v0.2

The first open-source Nepali TTS model built on a hand-crafted G2P — no eSpeak.

Kala is a multi-speaker VITS model trained with the real_nepali G2P frontend: a rule-and-lexicon system grounded in Khatiwada 2009 and tuned to mainstream Kathmandu Nepali phonology. The ONNX model runs on CPU in real time (RTF ≈ 0.02 — 50× faster than real time on a laptop).

Try it live: ampixa-real-nepali-tts.hf.space


Why a new G2P?

eSpeak-ng's ne voice was designed for phoneme coverage, not phonological accuracy. It maps Nepali affricates to alveolar labels (ts, tsh) that do not match how Kathmandu speakers produce च and छ. It silently loses gemination and does not handle Latin code-switching at all.

The real_nepali frontend:

Feature eSpeak ne real_nepali
च / छ ts / tsh (alveolar) ch / chh (palatal)
Gemination often lost explicit : tokens
Schwa deletion heuristic rule-based, audited
Latin code-switch undefined letter-by-letter or override lexicon
Phone inventory ~35 48 phones + geminated variants
Lexicon none 48 000-entry curated lexicon

On the NepTTS-Bench minimal-pairs test (365 sentences), the frontend reaches 99.5 % minimal-pair contrast preservation against the reference IPA transcriptions.


Available speakers

Speaker ID Data type Training hours
kala 2 human studio 0.37 h
barsha 1 human recording 1.62 h
slr143_F 3 corpus (OpenSLR-143) 1.01 h
slr43_0546 4 corpus (OpenSLR-43) 0.62 h
slr43_2099 5 corpus (OpenSLR-43) 0.51 h

Recommended speaker: kala for demo and production use. The corpus speakers (slr143_F, slr43_*) have good prosody but recording conditions vary; barsha is the second-best human voice.


Quick start (Python)

pip install kala-tts
import kala_tts

# Returns WAV bytes (16-bit PCM, 22050 Hz mono)
wav = kala_tts.synthesize("नमस्कार, कसरी हुनुहुन्छ?", speaker="kala")

# Write directly to a file
kala_tts.synthesize_to_file(
    "नेपाल सुन्दर देश हो।",
    "output.wav",
    speaker="kala",
)

# List available speakers
print(kala_tts.list_speakers())
# ('kala', 'barsha', 'slr143_F', 'slr43_0546', 'slr43_2099')
# CLI
kala-tts "नमस्कार, कसरी हुनुहुन्छ?" --speaker kala -o out.wav
kala-tts --list-speakers

The first call downloads the ONNX model (~60 MB) from this repo and caches it locally via huggingface_hub.


Manual inference (no pip)

Download the ONNX and config files from this repo, then:

git clone https://github.com/Ampixa/nepa-newa-text-frontend
cd nepa-newa-text-frontend
pip install onnxruntime huggingface_hub numpy
python -m kala_tts "नमस्कार" -o out.wav

Or use piper directly:

pip install piper-tts
echo "नमस्कार, कसरी हुनुहुन्छ?" | \
  piper --model real_nepali_v02_kala.fp32.onnx --speaker_id 2 --output_file out.wav

ONNX model details

Property Value
File real_nepali_v02_kala.fp32.onnx
Format FP32 ONNX (VITS encoder + decoder fused)
Sample rate 22050 Hz
Inputs input (int64 phone IDs), input_lengths, scales, sid
Speakers 6 (use sid to select)
RTF on laptop CPU ~0.02 (50× real-time)

Training details

Item Value
Base checkpoint piper-plus multilingual (302 MB)
Architecture VITS + monotonic attention
Total training rows 4 338
Total training hours 8.61 h
Training epochs 1 000
Framework piper-plus (patched for Nepali)
Hardware NVIDIA L40S 46 GB

Checkpoint SHA-256:

2b36b27f42e8549658676f953704573a31e2155fc95ec5d6407561e9fc4797fa

Training data

Speaker Source Rows Hours License
algenib Gemini-Flash synthetic (excluded from v0.2 public release) 1 984 4.47 h internal
barsha Human recital 808 1.62 h CC-BY-SA-4.0
kala Human studio 200 0.37 h CC-BY-SA-4.0
slr143_F OpenSLR-143 566 1.01 h CC-BY-SA-4.0
slr43_0546 OpenSLR-43 505 0.62 h CC-BY-SA-4.0
slr43_2099 OpenSLR-43 275 0.51 h CC-BY-SA-4.0

Known limitations

  • Naturalness gap: Trained on only 200 utterances for the Kala voice; prosody can be flat on long sentences.
  • Punctuation awareness: Periods, commas, and question marks are handled via deterministic pause insertion — the model does not learn intonation contours from punctuation tokens.
  • OOV words: Unknown Devanagari words fall back to letter-by-letter rules. The 48 000-entry lexicon covers ~95% of common vocabulary.
  • Numbers: Digits are read in Nepali word order; mixed Nepali/English numerals may produce unexpected output.

Citation

@misc{ampixa2026kala,
  title  = {Kala: CPU-native Nepali Text-to-Speech with a hand-crafted G2P},
  author = {Ampixa},
  year   = {2026},
  url    = {https://huggingface.co/ampixa/real-nepali-v0.2-kala},
}

Phonological foundation: Khatiwada (2009), Nepali, Journal of the International Phonetic Association, 39(3), 373–380.


License

Model weights and code: CC-BY-SA 4.0 Training corpus (OpenSLR-143, OpenSLR-43): CC-BY-SA 4.0 G2P lexicon seed (google/language-resources ne/): CC-BY 4.0

Downloads last month
100
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support