Hindi SLM Tokenizer — `hindi_slm_tokenizer_v001`

A production-ready Hindi subword tokenizer trained on 5 GB of verified Hindi text from the AI4Bharat Sangraha dataset. Built as the first component of an offline Hindi Small Language Model (SLM) designed to run on consumer hardware and small edge devices — no cloud, no internet required.

Model Details

Property	Value
Algorithm	Unigram Language Model (same as LLaMA, Mistral, Gemma)
Vocabulary size	32,000
Normalizer	NFKC
Pre-tokenizer	Metaspace (`▁`) — SentencePiece-compatible
Decoder	Metaspace
Max piece length	24 characters
Model max length	2048 tokens
Training corpus	5 GB — AI4Bharat Sangraha `verified/hin`
Library	HuggingFace `tokenizers` (Rust-backed)
Version	`hindi_slm_tokenizer_v001`
Training date	2026-05-18

Special Tokens

These token IDs are permanently frozen — the SLM embedding matrix is indexed by these IDs.

Token	ID	Purpose
`<pad>`	0	Padding
`<unk>`	1	Unknown
`<s>`	2	Beginning of sequence (BOS)
`</s>`	3	End of sequence (EOS)
`<\|system\|>`	4	Chat — system turn
`<\|user\|>`	5	Chat — user turn
`<\|assistant\|>`	6	Chat — assistant turn
`<\|end\|>`	7	Chat — turn end marker

Vocabulary Composition

80% of the vocabulary (25,599 tokens) contains Devanagari characters
20% covers numerals, Latin script (English words, abbreviations), punctuation, and special tokens
Average token length: 4.788 characters — most Hindi words tokenize to a single token

Validation Results

Evaluated on 20 manually curated Hindi sentences covering conversational text, named entities, mixed Hindi-English (code-switching), morphologically complex words, classical literature references, and numerals.

Metric	Value	Threshold	Status
`unk_rate`	0.000	< 0.001	✅ Pass
`chars_per_token`	4.788	> 3.0	✅ Pass
`tokens_per_word`	1.128	< 2.5	✅ Pass
`roundtrip_success_rate`	1.000	> 0.99	✅ Pass
`devanagari_char_coverage`	1.000	> 0.995	✅ Pass
`special_token_split_failures`	0	== 0	✅ Pass

1.128 tokens per word means nearly every Hindi word maps to a single token — no unnecessary fragmentation.

Variant Comparison

Three vocabulary sizes were trained and compared. 32k was selected for the best balance of coverage and embedding matrix size.

Variant	Vocab	unk_rate	chars/token	tokens/word	Selected
`hindi_unigram_24k_v001`	24,000	0.000	4.660	1.159
`hindi_unigram_32k_v001`	32,000	0.000	4.788	1.128	✅
`hindi_unigram_48k_v001`	48,000	0.000	4.904	1.102

How to Use

With HuggingFace `transformers`

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("vaibhavmaurya/hindi-slm-tokenizer-v001")

# Encode
inputs = tokenizer("हिंदी भाषा में प्रशिक्षण", return_tensors="pt")
print(inputs["input_ids"])

# Decode
text = tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)
print(text)

With HuggingFace `tokenizers` (lower level)

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("vaibhavmaurya/hindi-slm-tokenizer-v001")

encoding = tokenizer.encode("नमस्ते, आप कैसे हैं?")
print(encoding.tokens)   # ['▁नमस्ते', ',', '▁आप', '▁कैसे', '▁हैं', '?']
print(encoding.ids)      # [892, 15, 423, 1201, 88, 18]

# Perfect round-trip
decoded = tokenizer.decode(encoding.ids)
print(decoded)           # "नमस्ते, आप कैसे हैं?"

Chat format

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("vaibhavmaurya/hindi-slm-tokenizer-v001")

# Chat template example
messages = [
    {"role": "system", "content": "आप एक सहायक हैं।"},
    {"role": "user",   "content": "भारत की राजधानी क्या है?"},
]

# Manual formatting using special tokens
system_id  = tokenizer.convert_tokens_to_ids("<|system|>")   # 4
user_id    = tokenizer.convert_tokens_to_ids("<|user|>")     # 5
asst_id    = tokenizer.convert_tokens_to_ids("<|assistant|>")# 6
end_id     = tokenizer.convert_tokens_to_ids("<|end|>")      # 7

Training Data

Property	Detail
Dataset	AI4Bharat Sangraha
Subset	`verified/hin` — human-curated Hindi, highest quality tier
Corpus size used	5 GB
Sampling seed	42 (reproducible)
Min chars per document	30
Max chars per document	5,000
Min Devanagari ratio	60%

The verified subset of Sangraha was curated by AI4Bharat with human annotation and quality checks, making it the highest quality publicly available Hindi corpus.

Algorithm: Unigram Language Model

The Unigram algorithm (Kudo, 2018) starts with a large candidate vocabulary (~300k substrings) and iteratively prunes it to the target size using an EM (Expectation-Maximization) procedure. At inference, the Viterbi algorithm finds the maximum-probability segmentation of input text.

This is the same algorithm used in LLaMA, Mistral, Gemma, and most modern open-source LLMs. It handles Hindi's morphological richness better than BPE because it models all possible segmentations probabilistically rather than making greedy merge decisions.

References:

Integrity

SHA-256 of tokenizer.json:

fbe21c642a4a13030833be48733c1c6b78244e4c0bc077516422b22e7f046cd9

Verify:

import hashlib
from pathlib import Path

expected = "fbe21c642a4a13030833be48733c1c6b78244e4c0bc077516422b22e7f046cd9"
actual = hashlib.sha256(Path("tokenizer.json").read_bytes()).hexdigest()
assert actual == expected, "tokenizer.json has been modified"

Project Context

This tokenizer is component 1 of 3 in a Hindi SLM built entirely on a personal laptop:

Component	Status
✅ Data ingestion (5 GB Sangraha Hindi corpus)	Complete
✅ Tokenizer (`hindi_slm_tokenizer_v001`, 32k vocab)	This repo
🔜 SLM pretraining	Coming soon

Goal: A fully offline Hindi language model that runs on small devices — phones, edge hardware, consumer laptops — with no internet dependency. Data stays on the device.

Citation

@misc{hindi-slm-tokenizer-v001,
  author    = {Vaibhav Maurya},
  title     = {Hindi SLM Tokenizer v001 — Unigram 32k},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/vaibhavmaurya/hindi-slm-tokenizer-v001}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train vaibhavmaurya/hindi-slm-tokenizer-v001

Papers for vaibhavmaurya/hindi-slm-tokenizer-v001

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Paper • 1808.06226 • Published Aug 19, 2018 • 3

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Paper • 1804.10959 • Published Apr 29, 2018

Hindi SLM Tokenizer — hindi_slm_tokenizer_v001