GigaChat Nanai Retrained Tokenizer
This repository contains a retrained tokenizer built from the tokenizer setup of
ai-sage/GigaChat3-10B-A1.8B-base using the dataset
zxc0zxc0zxc/nanai-language.
The goal of this tokenizer is to explore whether a more Nanai-aware tokenization scheme can improve segmentation quality compared to the original base tokenizer.
Base model
The tokenizer experiments in this repository use
ai-sage/GigaChat3-10B-A1.8B-base as the reference base
tokenizer/model. The model card describes it as the base/pretrain GigaChat 3 model, with a Mixture-of-Experts
architecture and 10B total / 1.8B active parameters. The instruct version is published separately.
Dataset
Experiments were performed using
zxc0zxc0zxc/nanai-language, a Hugging Face dataset
labeled for Russian and Nanai, with tags related to dictionary data, bilingual lexicon extraction, tokenizer
training, and low-resource language work.
What “retrained” means
This repository contains a retrained tokenizer, meaning the tokenizer vocabulary and segmentation behavior were rebuilt from training data rather than only extended with a few additional tokens.
This makes the repository useful primarily for:
- tokenization research
- segmentation comparison
- vocabulary analysis for Nanai
- low-resource tokenizer experiments
Files
Tokenizer files
special_tokens_map.jsontokenizer.jsontokenizer_config.json
Artifacts
The artifacts/ directory contains evaluation files used during tokenizer comparison:
baseline_tokenizer_metrics.csvfinal_tokenizer_metrics.csvfocus_tokenization_examples.csvtokenization_examples.csvtokenizer_experiment_summary.json
Images
The img/ directory contains before/after comparison plots.
Comparison plots
Average tokenization metrics
Single-token-rate comparison
Intended use
This tokenizer is intended for:
- tokenizer research
- segmentation benchmarking
- Nanai vocabulary coverage analysis
- comparison with the original GigaChat tokenizer
Important note
This tokenizer is best treated as an experimental research artifact.
Because it is retrained rather than merely extended, it is mainly useful for studying tokenization behavior and vocabulary quality rather than as a drop-in tokenizer replacement in production workflows.
Example
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("zxc0zxc0zxc/gigachat-nanai-retrained-tokenizer")
text = "дёан биа"
print(tokenizer.tokenize(text))
Acknowledgements
Base tokenizer/model: ai-sage/GigaChat3-10B-A1.8B-base
Model tree for zxc0zxc0zxc/gigachat-nanai-retrained-tokenizer
Base model
ai-sage/GigaChat3-10B-A1.8B-base


