Kala — Nepali TTS v0.2

The first open-source Nepali TTS model built on a hand-crafted G2P — no eSpeak.

Kala is a multi-speaker VITS model trained with the real_nepali G2P frontend: a rule-and-lexicon system grounded in Khatiwada 2009 and tuned to mainstream Kathmandu Nepali phonology. The ONNX model runs on CPU in real time (RTF ≈ 0.02 — 50× faster than real time on a laptop).

▶ Try it live: ampixa-real-nepali-tts.hf.space

Why a new G2P?

eSpeak-ng's ne voice was designed for phoneme coverage, not phonological accuracy. It maps Nepali affricates to alveolar labels (ts, tsh) that do not match how Kathmandu speakers produce च and छ. It silently loses gemination and does not handle Latin code-switching at all.

The real_nepali frontend:

Feature	eSpeak `ne`	real_nepali
च / छ	`ts` / `tsh` (alveolar)	`ch` / `chh` (palatal)
Gemination	often lost	explicit `:` tokens
Schwa deletion	heuristic	rule-based, audited
Latin code-switch	undefined	letter-by-letter or override lexicon
Phone inventory	~35	48 phones + geminated variants
Lexicon	none	48 000-entry curated lexicon

On the NepTTS-Bench minimal-pairs test (365 sentences), the frontend reaches 99.5 % minimal-pair contrast preservation against the reference IPA transcriptions.

Available speakers

Speaker	ID	Data type	Training hours
`kala`	2	human studio	0.37 h
`barsha`	1	human recording	1.62 h
`slr143_F`	3	corpus (OpenSLR-143)	1.01 h
`slr43_0546`	4	corpus (OpenSLR-43)	0.62 h
`slr43_2099`	5	corpus (OpenSLR-43)	0.51 h

Recommended speaker: kala for demo and production use. The corpus speakers (slr143_F, slr43_*) have good prosody but recording conditions vary; barsha is the second-best human voice.

Quick start (Python)

pip install kala-tts

import kala_tts

# Returns WAV bytes (16-bit PCM, 22050 Hz mono)
wav = kala_tts.synthesize("नमस्कार, कसरी हुनुहुन्छ?", speaker="kala")

# Write directly to a file
kala_tts.synthesize_to_file(
    "नेपाल सुन्दर देश हो।",
    "output.wav",
    speaker="kala",
)

# List available speakers
print(kala_tts.list_speakers())
# ('kala', 'barsha', 'slr143_F', 'slr43_0546', 'slr43_2099')

# CLI
kala-tts "नमस्कार, कसरी हुनुहुन्छ?" --speaker kala -o out.wav
kala-tts --list-speakers

The first call downloads the ONNX model (~60 MB) from this repo and caches it locally via huggingface_hub.

Manual inference (no pip)

Download the ONNX and config files from this repo, then:

git clone https://github.com/Ampixa/nepa-newa-text-frontend
cd nepa-newa-text-frontend
pip install onnxruntime huggingface_hub numpy
python -m kala_tts "नमस्कार" -o out.wav

Or use piper directly:

pip install piper-tts
echo "नमस्कार, कसरी हुनुहुन्छ?" | \
  piper --model real_nepali_v02_kala.fp32.onnx --speaker_id 2 --output_file out.wav

ONNX model details

Property	Value
File	`real_nepali_v02_kala.fp32.onnx`
Format	FP32 ONNX (VITS encoder + decoder fused)
Sample rate	22050 Hz
Inputs	`input` (int64 phone IDs), `input_lengths`, `scales`, `sid`
Speakers	6 (use `sid` to select)
RTF on laptop CPU	~0.02 (50× real-time)

Training details

Item	Value
Base checkpoint	piper-plus multilingual (302 MB)
Architecture	VITS + monotonic attention
Total training rows	4 338
Total training hours	8.61 h
Training epochs	1 000
Framework	piper-plus (patched for Nepali)
Hardware	NVIDIA L40S 46 GB

Checkpoint SHA-256:

2b36b27f42e8549658676f953704573a31e2155fc95ec5d6407561e9fc4797fa

Training data

Speaker	Source	Rows	Hours	License
`algenib`	Gemini-Flash synthetic (excluded from v0.2 public release)	1 984	4.47 h	internal
`barsha`	Human recital	808	1.62 h	CC-BY-SA-4.0
`kala`	Human studio	200	0.37 h	CC-BY-SA-4.0
`slr143_F`	OpenSLR-143	566	1.01 h	CC-BY-SA-4.0
`slr43_0546`	OpenSLR-43	505	0.62 h	CC-BY-SA-4.0
`slr43_2099`	OpenSLR-43	275	0.51 h	CC-BY-SA-4.0

Known limitations

Naturalness gap: Trained on only 200 utterances for the Kala voice; prosody can be flat on long sentences.
Punctuation awareness: Periods, commas, and question marks are handled via deterministic pause insertion — the model does not learn intonation contours from punctuation tokens.
OOV words: Unknown Devanagari words fall back to letter-by-letter rules. The 48 000-entry lexicon covers ~95% of common vocabulary.
Numbers: Digits are read in Nepali word order; mixed Nepali/English numerals may produce unexpected output.

Citation

@misc{ampixa2026kala,
  title  = {Kala: CPU-native Nepali Text-to-Speech with a hand-crafted G2P},
  author = {Ampixa},
  year   = {2026},
  url    = {https://huggingface.co/ampixa/real-nepali-v0.2-kala},
}

Phonological foundation: Khatiwada (2009), Nepali, Journal of the International Phonetic Association, 39(3), 373–380.

License

Model weights and code: CC-BY-SA 4.0 Training corpus (OpenSLR-143, OpenSLR-43): CC-BY-SA 4.0 G2P lexicon seed (google/language-resources ne/): CC-BY 4.0

Downloads last month: 100