---
language:
- en
- hi
- bn
- ta
- te
- kn
- ml
- mr
- gu
- pa
- or
- as
library_name: transformers
license: apache-2.0
pipeline_tag: feature-extraction
tags:
- tokenizer
- byte-level-bpe
- brahmic
- indic
- multilingual
- hindi
- bengali
- tamil
- telugu
- kannada
- malayalam
- marathi
- gujarati
- punjabi
- odia
- assamese
---

# BrahmicTokenizer-131K

A 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. Drop-in replacement for any o200k_base training pipeline: same byte-level BPE algorithm, same GPT-2 ByteLevel pre-tokenizer, same decoder, same vocabulary file format.

The model was presented in the paper [BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base](https://arxiv.org/abs/2605.29379).

## Citation

```
@misc{shravan2026brahmictokenizer,
  title={BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k\_base},
  author={Rohan Shravan},
  year={2026},
  eprint={2605.29379},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2605.29379}
}
```

## Headline results

On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB):

- **26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m** at the same 131K vocab budget
- **Per-language savings 15.79% (Tamil) to 76.79% (Odia, a 4.31× compression ratio)**
- Holds on **11 of 11** Brahmic languages with no exceptions

On non-Indic content (FLORES-200, HumanEval, MBPP, GSM8K):

- **English fertility 1.235 tokens/word** — matches o200k_base (1.232)
- **Best-in-class code/math compression** at the 131K vocab class (0.295 / 0.320 / 0.301 tokens/char on HumanEval / MBPP / GSM8K)
- **Beats Tekken/Sarvam-m by 4.0–14.2%** on HumanEval, MBPP, GSM8K
- EU language fertility within 3% of best (French 1.464, German 1.653, Spanish 1.388)

On FLORES-200 dev+devtest Brahmic fertility (rank 4 of 11 publicly downloadable tokenizers):

- BrahmicTokenizer-131K **2.84** mean Brahmic fertility vs Tekken/Sarvam-m's 4.87 — a **41.8% relative improvement at the same vocab budget**
- Beats Tekken/Sarvam-m on every Brahmic language individually (Or 77%, As 42%, Gu 37%, Ml 27%, …)

Across the 14-tokenizer benchmark, BrahmicTokenizer-131K is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K vocabulary budget.

## Usage

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("theschoolofai/BrahmicTokenizer-131K")

# Hindi
print(tokenizer.encode("भारत एक देश है", add_special_tokens=False))
# -> [66526, 2420, 13092, 732]

# Digit grouping (inherited bit-identically from o200k_base)
print(tokenizer.encode("1234567890", add_special_tokens=False))
# -> [4660, 14932, 23133, 26]
# Decoded: ['123', '456', '789', '0']
```

Vocabulary: 131,072 tokens. Specials: 356 added tokens including the standard EOS (`<|end_of_text|>`, ID 36), BOS (`<|begin_of_text|>`, ID 130725), PAD (`<|pad|>`, ID 130726), UNK (`<|unk|>`, ID 130727), FIM, multimodal, and reserved-slot markers.

## Construction

Two-stage surgical retrofit of o200k_base:

1. **Stage 1 — script-prune crop**: removed 38,345 tokens covering 9 non-target scripts (CJK Unified Ideographs, Hangul, Hiragana+Katakana, Arabic, Cyrillic, Thai, Greek, Hebrew, Sinhala), reducing 200,019 → 131,072 slots and forming `o200k_cropped`.
2. **Stage 2 — surgical Brahmic retrofit**: replaced 2,372 corpus-dead vocabulary slots in `o200k_cropped` with high-frequency Brahmic content, allocated across the 9 Brahmic scripts by linear-programming optimization on a 1.045-billion-token audit corpus.

The pre-tokenizer, decoder, and English/EU/code merge rules are inherited unchanged from o200k_base. The vocabulary content differs in 40,717 of 131,072 slots, but the tokenizer-side interface (algorithm, pre-tokenizer regex, decoder, special-token format, JSON schema) is identical.

## Structural properties

- Every normal token ≤ 32 UTF-8 bytes (max 32, longest is a 32-space filler)
- Zero tokens spanning two disjoint writing systems
- Zero cross-script merge rules in the 301,398-entry merge list

These two properties make BrahmicTokenizer-131K and `o200k_cropped` the **only two of 14 publicly-available tokenizers we benchmarked** to satisfy both constraints simultaneously, which matters for byte-pooled embedding architectures with a fixed per-token-byte budget.

## Files

- `tokenizer.json` — the BPE artifact (8.0 MB, vocab 131,072, merges 301,398, added tokens 356)
- `tokenizer_config.json` — HuggingFace `AutoTokenizer` configuration
- `special_tokens_map.json` — BOS/EOS/PAD/UNK declarations
- `LICENSE` — Apache 2.0

The reproduction scripts (verification, fertility evaluation, 27M-corpus tokenization, 23-test audit) live in the GitHub repository: <https://github.com/theschoolofai/BrahmicTokenizer-131K>.

## License

Apache License 2.0. This work is a derivative of OpenAI's o200k_base tokenizer, released through the MIT-licensed [tiktoken](https://github.com/openai/tiktoken) repository; Apache 2.0 is compatible with incorporating MIT-licensed material. The bundled Brahmic-script fonts referenced in the paper (`NotoSansDevanagari`, `NotoSansBengali`, `NotoSansOriya`, `NotoSansTamil`) are redistributed under the SIL Open Font License 1.1.