Hindi SLM Tokenizer — hindi_slm_tokenizer_v001

A production-ready Hindi subword tokenizer trained on 5 GB of verified Hindi text from the AI4Bharat Sangraha dataset. Built as the first component of an offline Hindi Small Language Model (SLM) designed to run on consumer hardware and small edge devices — no cloud, no internet required.


Model Details

Property Value
Algorithm Unigram Language Model (same as LLaMA, Mistral, Gemma)
Vocabulary size 32,000
Normalizer NFKC
Pre-tokenizer Metaspace () — SentencePiece-compatible
Decoder Metaspace
Max piece length 24 characters
Model max length 2048 tokens
Training corpus 5 GB — AI4Bharat Sangraha verified/hin
Library HuggingFace tokenizers (Rust-backed)
Version hindi_slm_tokenizer_v001
Training date 2026-05-18

Special Tokens

These token IDs are permanently frozen — the SLM embedding matrix is indexed by these IDs.

Token ID Purpose
<pad> 0 Padding
<unk> 1 Unknown
<s> 2 Beginning of sequence (BOS)
</s> 3 End of sequence (EOS)
<|system|> 4 Chat — system turn
<|user|> 5 Chat — user turn
<|assistant|> 6 Chat — assistant turn
<|end|> 7 Chat — turn end marker

Vocabulary Composition

  • 80% of the vocabulary (25,599 tokens) contains Devanagari characters
  • 20% covers numerals, Latin script (English words, abbreviations), punctuation, and special tokens
  • Average token length: 4.788 characters — most Hindi words tokenize to a single token

Validation Results

Evaluated on 20 manually curated Hindi sentences covering conversational text, named entities, mixed Hindi-English (code-switching), morphologically complex words, classical literature references, and numerals.

Metric Value Threshold Status
unk_rate 0.000 < 0.001 ✅ Pass
chars_per_token 4.788 > 3.0 ✅ Pass
tokens_per_word 1.128 < 2.5 ✅ Pass
roundtrip_success_rate 1.000 > 0.99 ✅ Pass
devanagari_char_coverage 1.000 > 0.995 ✅ Pass
special_token_split_failures 0 == 0 ✅ Pass

1.128 tokens per word means nearly every Hindi word maps to a single token — no unnecessary fragmentation.


Variant Comparison

Three vocabulary sizes were trained and compared. 32k was selected for the best balance of coverage and embedding matrix size.

Variant Vocab unk_rate chars/token tokens/word Selected
hindi_unigram_24k_v001 24,000 0.000 4.660 1.159
hindi_unigram_32k_v001 32,000 0.000 4.788 1.128
hindi_unigram_48k_v001 48,000 0.000 4.904 1.102

How to Use

With HuggingFace transformers

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("vaibhavmaurya/hindi-slm-tokenizer-v001")

# Encode
inputs = tokenizer("हिंदी भाषा में प्रशिक्षण", return_tensors="pt")
print(inputs["input_ids"])

# Decode
text = tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)
print(text)

With HuggingFace tokenizers (lower level)

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("vaibhavmaurya/hindi-slm-tokenizer-v001")

encoding = tokenizer.encode("नमस्ते, आप कैसे हैं?")
print(encoding.tokens)   # ['▁नमस्ते', ',', '▁आप', '▁कैसे', '▁हैं', '?']
print(encoding.ids)      # [892, 15, 423, 1201, 88, 18]

# Perfect round-trip
decoded = tokenizer.decode(encoding.ids)
print(decoded)           # "नमस्ते, आप कैसे हैं?"

Chat format

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("vaibhavmaurya/hindi-slm-tokenizer-v001")

# Chat template example
messages = [
    {"role": "system", "content": "आप एक सहायक हैं।"},
    {"role": "user",   "content": "भारत की राजधानी क्या है?"},
]

# Manual formatting using special tokens
system_id  = tokenizer.convert_tokens_to_ids("<|system|>")   # 4
user_id    = tokenizer.convert_tokens_to_ids("<|user|>")     # 5
asst_id    = tokenizer.convert_tokens_to_ids("<|assistant|>")# 6
end_id     = tokenizer.convert_tokens_to_ids("<|end|>")      # 7

Training Data

Property Detail
Dataset AI4Bharat Sangraha
Subset verified/hin — human-curated Hindi, highest quality tier
Corpus size used 5 GB
Sampling seed 42 (reproducible)
Min chars per document 30
Max chars per document 5,000
Min Devanagari ratio 60%

The verified subset of Sangraha was curated by AI4Bharat with human annotation and quality checks, making it the highest quality publicly available Hindi corpus.


Algorithm: Unigram Language Model

The Unigram algorithm (Kudo, 2018) starts with a large candidate vocabulary (~300k substrings) and iteratively prunes it to the target size using an EM (Expectation-Maximization) procedure. At inference, the Viterbi algorithm finds the maximum-probability segmentation of input text.

This is the same algorithm used in LLaMA, Mistral, Gemma, and most modern open-source LLMs. It handles Hindi's morphological richness better than BPE because it models all possible segmentations probabilistically rather than making greedy merge decisions.

References:


Integrity

SHA-256 of tokenizer.json:

fbe21c642a4a13030833be48733c1c6b78244e4c0bc077516422b22e7f046cd9

Verify:

import hashlib
from pathlib import Path

expected = "fbe21c642a4a13030833be48733c1c6b78244e4c0bc077516422b22e7f046cd9"
actual = hashlib.sha256(Path("tokenizer.json").read_bytes()).hexdigest()
assert actual == expected, "tokenizer.json has been modified"

Project Context

This tokenizer is component 1 of 3 in a Hindi SLM built entirely on a personal laptop:

Component Status
✅ Data ingestion (5 GB Sangraha Hindi corpus) Complete
✅ Tokenizer (hindi_slm_tokenizer_v001, 32k vocab) This repo
🔜 SLM pretraining Coming soon

Goal: A fully offline Hindi language model that runs on small devices — phones, edge hardware, consumer laptops — with no internet dependency. Data stays on the device.


Citation

@misc{hindi-slm-tokenizer-v001,
  author    = {Vaibhav Maurya},
  title     = {Hindi SLM Tokenizer v001 — Unigram 32k},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/vaibhavmaurya/hindi-slm-tokenizer-v001}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train vaibhavmaurya/hindi-slm-tokenizer-v001

Papers for vaibhavmaurya/hindi-slm-tokenizer-v001