---
language:
- kk
license: apache-2.0
library_name: transformers
tags:
- sentencepiece
- t5
- kazakh
- tokenizer
---

# Kazakh T5 SentencePiece Tokenizer (32K)

SentencePiece unigram tokenizer trained on Kazakh text for T5 pretraining.

## Details

| Property | Value |
|---|---|
| Algorithm | SentencePiece Unigram |
| Base vocab | 32,000 |
| Sentinel tokens | 128 (`<extra_id_0>` ... `<extra_id_127>`) |
| Total vocab | 32,128 |
| Special tokens | `<pad>`=0, `</s>`=1, `<unk>`=2 |
| Character coverage | 99.95% |
| Byte fallback | Yes |
| Normalization | Identity (no NFKC) |

## Training Data

Trained on 5M samples from [stukenov/kazakh-clean-pretrain-text](https://huggingface.co/datasets/stukenov/kazakh-clean-pretrain-text).

## Usage

```python
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("stukenov/kazakh-t5-sp-32k")

text = "Kazakh text here"
tokens = tokenizer.encode(text)
print(tokenizer.decode(tokens))
```

## Design Choices

- **Unigram model**: Matches original T5 tokenizer (not BPE)
- **Identity normalization**: Preserves Kazakh-specific characters without NFKC folding
- **Byte fallback**: Handles any Unicode character without `<unk>`
- **No BOS token**: T5 convention (uses `</s>` as EOS only)
- **128 sentinels**: For T5 span corruption pretraining