Hindi SLM Tokenizer — hindi_slm_tokenizer_v001
A production-ready Hindi subword tokenizer trained on 5 GB of verified Hindi text from the AI4Bharat Sangraha dataset. Built as the first component of an offline Hindi Small Language Model (SLM) designed to run on consumer hardware and small edge devices — no cloud, no internet required.
Model Details
| Property | Value |
|---|---|
| Algorithm | Unigram Language Model (same as LLaMA, Mistral, Gemma) |
| Vocabulary size | 32,000 |
| Normalizer | NFKC |
| Pre-tokenizer | Metaspace (▁) — SentencePiece-compatible |
| Decoder | Metaspace |
| Max piece length | 24 characters |
| Model max length | 2048 tokens |
| Training corpus | 5 GB — AI4Bharat Sangraha verified/hin |
| Library | HuggingFace tokenizers (Rust-backed) |
| Version | hindi_slm_tokenizer_v001 |
| Training date | 2026-05-18 |
Special Tokens
These token IDs are permanently frozen — the SLM embedding matrix is indexed by these IDs.
| Token | ID | Purpose |
|---|---|---|
<pad> |
0 | Padding |
<unk> |
1 | Unknown |
<s> |
2 | Beginning of sequence (BOS) |
</s> |
3 | End of sequence (EOS) |
<|system|> |
4 | Chat — system turn |
<|user|> |
5 | Chat — user turn |
<|assistant|> |
6 | Chat — assistant turn |
<|end|> |
7 | Chat — turn end marker |
Vocabulary Composition
- 80% of the vocabulary (25,599 tokens) contains Devanagari characters
- 20% covers numerals, Latin script (English words, abbreviations), punctuation, and special tokens
- Average token length: 4.788 characters — most Hindi words tokenize to a single token
Validation Results
Evaluated on 20 manually curated Hindi sentences covering conversational text, named entities, mixed Hindi-English (code-switching), morphologically complex words, classical literature references, and numerals.
| Metric | Value | Threshold | Status |
|---|---|---|---|
unk_rate |
0.000 | < 0.001 | ✅ Pass |
chars_per_token |
4.788 | > 3.0 | ✅ Pass |
tokens_per_word |
1.128 | < 2.5 | ✅ Pass |
roundtrip_success_rate |
1.000 | > 0.99 | ✅ Pass |
devanagari_char_coverage |
1.000 | > 0.995 | ✅ Pass |
special_token_split_failures |
0 | == 0 | ✅ Pass |
1.128 tokens per word means nearly every Hindi word maps to a single token — no unnecessary fragmentation.
Variant Comparison
Three vocabulary sizes were trained and compared. 32k was selected for the best balance of coverage and embedding matrix size.
| Variant | Vocab | unk_rate | chars/token | tokens/word | Selected |
|---|---|---|---|---|---|
hindi_unigram_24k_v001 |
24,000 | 0.000 | 4.660 | 1.159 | |
hindi_unigram_32k_v001 |
32,000 | 0.000 | 4.788 | 1.128 | ✅ |
hindi_unigram_48k_v001 |
48,000 | 0.000 | 4.904 | 1.102 |
How to Use
With HuggingFace transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("vaibhavmaurya/hindi-slm-tokenizer-v001")
# Encode
inputs = tokenizer("हिंदी भाषा में प्रशिक्षण", return_tensors="pt")
print(inputs["input_ids"])
# Decode
text = tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)
print(text)
With HuggingFace tokenizers (lower level)
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("vaibhavmaurya/hindi-slm-tokenizer-v001")
encoding = tokenizer.encode("नमस्ते, आप कैसे हैं?")
print(encoding.tokens) # ['▁नमस्ते', ',', '▁आप', '▁कैसे', '▁हैं', '?']
print(encoding.ids) # [892, 15, 423, 1201, 88, 18]
# Perfect round-trip
decoded = tokenizer.decode(encoding.ids)
print(decoded) # "नमस्ते, आप कैसे हैं?"
Chat format
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("vaibhavmaurya/hindi-slm-tokenizer-v001")
# Chat template example
messages = [
{"role": "system", "content": "आप एक सहायक हैं।"},
{"role": "user", "content": "भारत की राजधानी क्या है?"},
]
# Manual formatting using special tokens
system_id = tokenizer.convert_tokens_to_ids("<|system|>") # 4
user_id = tokenizer.convert_tokens_to_ids("<|user|>") # 5
asst_id = tokenizer.convert_tokens_to_ids("<|assistant|>")# 6
end_id = tokenizer.convert_tokens_to_ids("<|end|>") # 7
Training Data
| Property | Detail |
|---|---|
| Dataset | AI4Bharat Sangraha |
| Subset | verified/hin — human-curated Hindi, highest quality tier |
| Corpus size used | 5 GB |
| Sampling seed | 42 (reproducible) |
| Min chars per document | 30 |
| Max chars per document | 5,000 |
| Min Devanagari ratio | 60% |
The verified subset of Sangraha was curated by AI4Bharat with human annotation and quality checks, making it the highest quality publicly available Hindi corpus.
Algorithm: Unigram Language Model
The Unigram algorithm (Kudo, 2018) starts with a large candidate vocabulary (~300k substrings) and iteratively prunes it to the target size using an EM (Expectation-Maximization) procedure. At inference, the Viterbi algorithm finds the maximum-probability segmentation of input text.
This is the same algorithm used in LLaMA, Mistral, Gemma, and most modern open-source LLMs. It handles Hindi's morphological richness better than BPE because it models all possible segmentations probabilistically rather than making greedy merge decisions.
References:
Integrity
SHA-256 of tokenizer.json:
fbe21c642a4a13030833be48733c1c6b78244e4c0bc077516422b22e7f046cd9
Verify:
import hashlib
from pathlib import Path
expected = "fbe21c642a4a13030833be48733c1c6b78244e4c0bc077516422b22e7f046cd9"
actual = hashlib.sha256(Path("tokenizer.json").read_bytes()).hexdigest()
assert actual == expected, "tokenizer.json has been modified"
Project Context
This tokenizer is component 1 of 3 in a Hindi SLM built entirely on a personal laptop:
| Component | Status |
|---|---|
| ✅ Data ingestion (5 GB Sangraha Hindi corpus) | Complete |
✅ Tokenizer (hindi_slm_tokenizer_v001, 32k vocab) |
This repo |
| 🔜 SLM pretraining | Coming soon |
Goal: A fully offline Hindi language model that runs on small devices — phones, edge hardware, consumer laptops — with no internet dependency. Data stays on the device.
Citation
@misc{hindi-slm-tokenizer-v001,
author = {Vaibhav Maurya},
title = {Hindi SLM Tokenizer v001 — Unigram 32k},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/vaibhavmaurya/hindi-slm-tokenizer-v001}
}
License
Apache 2.0