--- language: - kk license: apache-2.0 library_name: transformers tags: - sentencepiece - t5 - kazakh - tokenizer --- # Kazakh T5 SentencePiece Tokenizer (32K) SentencePiece unigram tokenizer trained on Kazakh text for T5 pretraining. ## Details | Property | Value | |---|---| | Algorithm | SentencePiece Unigram | | Base vocab | 32,000 | | Sentinel tokens | 128 (`` ... ``) | | Total vocab | 32,128 | | Special tokens | ``=0, ``=1, ``=2 | | Character coverage | 99.95% | | Byte fallback | Yes | | Normalization | Identity (no NFKC) | ## Training Data Trained on 5M samples from [stukenov/kazakh-clean-pretrain-text](https://huggingface.co/datasets/stukenov/kazakh-clean-pretrain-text). ## Usage ```python from transformers import T5Tokenizer tokenizer = T5Tokenizer.from_pretrained("stukenov/kazakh-t5-sp-32k") text = "Kazakh text here" tokens = tokenizer.encode(text) print(tokenizer.decode(tokens)) ``` ## Design Choices - **Unigram model**: Matches original T5 tokenizer (not BPE) - **Identity normalization**: Preserves Kazakh-specific characters without NFKC folding - **Byte fallback**: Handles any Unicode character without `` - **No BOS token**: T5 convention (uses `` as EOS only) - **128 sentinels**: For T5 span corruption pretraining