GigaChat Nanai Retrained Tokenizer

This repository contains a retrained tokenizer built from the tokenizer setup of ai-sage/GigaChat3-10B-A1.8B-base using the dataset zxc0zxc0zxc/nanai-language.

The goal of this tokenizer is to explore whether a more Nanai-aware tokenization scheme can improve segmentation quality compared to the original base tokenizer.

Base model

The tokenizer experiments in this repository use ai-sage/GigaChat3-10B-A1.8B-base as the reference base tokenizer/model. The model card describes it as the base/pretrain GigaChat 3 model, with a Mixture-of-Experts architecture and 10B total / 1.8B active parameters. The instruct version is published separately.

Dataset

Experiments were performed using zxc0zxc0zxc/nanai-language, a Hugging Face dataset labeled for Russian and Nanai, with tags related to dictionary data, bilingual lexicon extraction, tokenizer training, and low-resource language work.

What “retrained” means

This repository contains a retrained tokenizer, meaning the tokenizer vocabulary and segmentation behavior were rebuilt from training data rather than only extended with a few additional tokens.

This makes the repository useful primarily for:

tokenization research
segmentation comparison
vocabulary analysis for Nanai
low-resource tokenizer experiments

Files

Tokenizer files

special_tokens_map.json
tokenizer.json
tokenizer_config.json

Artifacts

The artifacts/ directory contains evaluation files used during tokenizer comparison:

baseline_tokenizer_metrics.csv
final_tokenizer_metrics.csv
focus_tokenization_examples.csv
tokenization_examples.csv
tokenizer_experiment_summary.json

Images

The img/ directory contains before/after comparison plots.

Comparison plots

Average tokenization metrics

Base	Final

Single-token-rate comparison

Base	Final

Intended use

This tokenizer is intended for:

tokenizer research
segmentation benchmarking
Nanai vocabulary coverage analysis
comparison with the original GigaChat tokenizer

Important note

This tokenizer is best treated as an experimental research artifact.

Because it is retrained rather than merely extended, it is mainly useful for studying tokenization behavior and vocabulary quality rather than as a drop-in tokenizer replacement in production workflows.

Example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("zxc0zxc0zxc/gigachat-nanai-retrained-tokenizer")

text = "дёан биа"
print(tokenizer.tokenize(text))

Acknowledgements

Base tokenizer/model: ai-sage/GigaChat3-10B-A1.8B-base

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zxc0zxc0zxc/gigachat-nanai-retrained-tokenizer

Base model

ai-sage/GigaChat3-10B-A1.8B-base

Finetuned

(3)

this model

zxc0zxc0zxc
/

gigachat-nanai-retrained-tokenizer

GigaChat Nanai Retrained Tokenizer

Base model

Dataset

What “retrained” means

Files

Tokenizer files

Artifacts

Images

Comparison plots

Average tokenization metrics

Single-token-rate comparison

Intended use

Important note

Example

Acknowledgements

Model tree for zxc0zxc0zxc/gigachat-nanai-retrained-tokenizer

Dataset used to train zxc0zxc0zxc/gigachat-nanai-retrained-tokenizer