--- language: - en - hi - bn - ta - te - kn - ml - mr - gu - pa - or - as library_name: transformers license: apache-2.0 pipeline_tag: feature-extraction tags: - tokenizer - byte-level-bpe - brahmic - indic - multilingual - hindi - bengali - tamil - telugu - kannada - malayalam - marathi - gujarati - punjabi - odia - assamese --- # BrahmicTokenizer-131K A 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. Drop-in replacement for any o200k_base training pipeline: same byte-level BPE algorithm, same GPT-2 ByteLevel pre-tokenizer, same decoder, same vocabulary file format. The model was presented in the paper [BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base](https://arxiv.org/abs/2605.29379). ## Citation ``` @misc{shravan2026brahmictokenizer, title={BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k\_base}, author={Rohan Shravan}, year={2026}, eprint={2605.29379}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2605.29379} } ``` ## Headline results On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB): - **26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m** at the same 131K vocab budget - **Per-language savings 15.79% (Tamil) to 76.79% (Odia, a 4.31× compression ratio)** - Holds on **11 of 11** Brahmic languages with no exceptions On non-Indic content (FLORES-200, HumanEval, MBPP, GSM8K): - **English fertility 1.235 tokens/word** — matches o200k_base (1.232) - **Best-in-class code/math compression** at the 131K vocab class (0.295 / 0.320 / 0.301 tokens/char on HumanEval / MBPP / GSM8K) - **Beats Tekken/Sarvam-m by 4.0–14.2%** on HumanEval, MBPP, GSM8K - EU language fertility within 3% of best (French 1.464, German 1.653, Spanish 1.388) On FLORES-200 dev+devtest Brahmic fertility (rank 4 of 11 publicly downloadable tokenizers): - BrahmicTokenizer-131K **2.84** mean Brahmic fertility vs Tekken/Sarvam-m's 4.87 — a **41.8% relative improvement at the same vocab budget** - Beats Tekken/Sarvam-m on every Brahmic language individually (Or 77%, As 42%, Gu 37%, Ml 27%, …) Across the 14-tokenizer benchmark, BrahmicTokenizer-131K is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K vocabulary budget. ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("theschoolofai/BrahmicTokenizer-131K") # Hindi print(tokenizer.encode("भारत एक देश है", add_special_tokens=False)) # -> [66526, 2420, 13092, 732] # Digit grouping (inherited bit-identically from o200k_base) print(tokenizer.encode("1234567890", add_special_tokens=False)) # -> [4660, 14932, 23133, 26] # Decoded: ['123', '456', '789', '0'] ``` Vocabulary: 131,072 tokens. Specials: 356 added tokens including the standard EOS (`<|end_of_text|>`, ID 36), BOS (`<|begin_of_text|>`, ID 130725), PAD (`<|pad|>`, ID 130726), UNK (`<|unk|>`, ID 130727), FIM, multimodal, and reserved-slot markers. ## Construction Two-stage surgical retrofit of o200k_base: 1. **Stage 1 — script-prune crop**: removed 38,345 tokens covering 9 non-target scripts (CJK Unified Ideographs, Hangul, Hiragana+Katakana, Arabic, Cyrillic, Thai, Greek, Hebrew, Sinhala), reducing 200,019 → 131,072 slots and forming `o200k_cropped`. 2. **Stage 2 — surgical Brahmic retrofit**: replaced 2,372 corpus-dead vocabulary slots in `o200k_cropped` with high-frequency Brahmic content, allocated across the 9 Brahmic scripts by linear-programming optimization on a 1.045-billion-token audit corpus. The pre-tokenizer, decoder, and English/EU/code merge rules are inherited unchanged from o200k_base. The vocabulary content differs in 40,717 of 131,072 slots, but the tokenizer-side interface (algorithm, pre-tokenizer regex, decoder, special-token format, JSON schema) is identical. ## Structural properties - Every normal token ≤ 32 UTF-8 bytes (max 32, longest is a 32-space filler) - Zero tokens spanning two disjoint writing systems - Zero cross-script merge rules in the 301,398-entry merge list These two properties make BrahmicTokenizer-131K and `o200k_cropped` the **only two of 14 publicly-available tokenizers we benchmarked** to satisfy both constraints simultaneously, which matters for byte-pooled embedding architectures with a fixed per-token-byte budget. ## Files - `tokenizer.json` — the BPE artifact (8.0 MB, vocab 131,072, merges 301,398, added tokens 356) - `tokenizer_config.json` — HuggingFace `AutoTokenizer` configuration - `special_tokens_map.json` — BOS/EOS/PAD/UNK declarations - `LICENSE` — Apache 2.0 The reproduction scripts (verification, fertility evaluation, 27M-corpus tokenization, 23-test audit) live in the GitHub repository: . ## License Apache License 2.0. This work is a derivative of OpenAI's o200k_base tokenizer, released through the MIT-licensed [tiktoken](https://github.com/openai/tiktoken) repository; Apache 2.0 is compatible with incorporating MIT-licensed material. The bundled Brahmic-script fonts referenced in the paper (`NotoSansDevanagari`, `NotoSansBengali`, `NotoSansOriya`, `NotoSansTamil`) are redistributed under the SIL Open Font License 1.1.