GigaChat Nanai Retrained Tokenizer

This repository contains a retrained tokenizer built from the tokenizer setup of ai-sage/GigaChat3-10B-A1.8B-base using the dataset zxc0zxc0zxc/nanai-language.

The goal of this tokenizer is to explore whether a more Nanai-aware tokenization scheme can improve segmentation quality compared to the original base tokenizer.

Base model

The tokenizer experiments in this repository use ai-sage/GigaChat3-10B-A1.8B-base as the reference base tokenizer/model. The model card describes it as the base/pretrain GigaChat 3 model, with a Mixture-of-Experts architecture and 10B total / 1.8B active parameters. The instruct version is published separately.

Dataset

Experiments were performed using zxc0zxc0zxc/nanai-language, a Hugging Face dataset labeled for Russian and Nanai, with tags related to dictionary data, bilingual lexicon extraction, tokenizer training, and low-resource language work.

What “retrained” means

This repository contains a retrained tokenizer, meaning the tokenizer vocabulary and segmentation behavior were rebuilt from training data rather than only extended with a few additional tokens.

This makes the repository useful primarily for:

  • tokenization research
  • segmentation comparison
  • vocabulary analysis for Nanai
  • low-resource tokenizer experiments

Files

Tokenizer files

  • special_tokens_map.json
  • tokenizer.json
  • tokenizer_config.json

Artifacts

The artifacts/ directory contains evaluation files used during tokenizer comparison:

  • baseline_tokenizer_metrics.csv
  • final_tokenizer_metrics.csv
  • focus_tokenization_examples.csv
  • tokenization_examples.csv
  • tokenizer_experiment_summary.json

Images

The img/ directory contains before/after comparison plots.

Comparison plots

Average tokenization metrics

Base Final

Single-token-rate comparison

Base Final

Intended use

This tokenizer is intended for:

  • tokenizer research
  • segmentation benchmarking
  • Nanai vocabulary coverage analysis
  • comparison with the original GigaChat tokenizer

Important note

This tokenizer is best treated as an experimental research artifact.

Because it is retrained rather than merely extended, it is mainly useful for studying tokenization behavior and vocabulary quality rather than as a drop-in tokenizer replacement in production workflows.

Example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("zxc0zxc0zxc/gigachat-nanai-retrained-tokenizer")

text = "дёан биа"
print(tokenizer.tokenize(text))

Acknowledgements

Base tokenizer/model: ai-sage/GigaChat3-10B-A1.8B-base

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zxc0zxc0zxc/gigachat-nanai-retrained-tokenizer

Finetuned
(3)
this model

Dataset used to train zxc0zxc0zxc/gigachat-nanai-retrained-tokenizer