--- language: - kk - en - ru license: mit tags: - kazakh - gec - translation - nllb - seq2seq - pretrained datasets: - stukenov/sozkz-corpus-pretrain-gec-mix-v1 base_model: facebook/nllb-200-1.3B model-index: - name: sozkz-nllb-1b-kk-pretrain-v1 results: - task: type: text-generation name: Grammatical Error Correction dataset: name: sozkz-corpus-gec-benchmark-kk-v1 type: stukenov/sozkz-corpus-gec-benchmark-kk-v1 split: test config: default metrics: - name: Exact Match (100-example custom) type: exact_match value: 1 --- # sozkz-nllb-1b-kk-pretrain-v1 NLLB-200-1.3B pretrained on a **mixed GEC + translation dataset** for Kazakh. Use this as a starting point for further fine-tuning on Kazakh GEC or kk↔en/ru translation tasks. ## Training | Dataset | Pairs | Epochs | LR | Hardware | |---------|-------|--------|----|----------| | [sozkz-corpus-pretrain-gec-mix-v1](https://huggingface.co/datasets/stukenov/sozkz-corpus-pretrain-gec-mix-v1) | 1.77M | 2 | 2e-5 | 1× H100 SXM 80GB, bf16 | The pretrain mix contains: - Kazakh GEC pairs (noisy → corrected) - kk↔en and kk↔ru translation pairs ## Evaluation Evaluated on the canonical **200-example** Kazakh GEC test set ([stukenov/sozkz-corpus-gec-benchmark-kk-v1](https://huggingface.co/datasets/stukenov/sozkz-corpus-gec-benchmark-kk-v1), split `test`) using `beam=5`, `kaz_Cyrl → kaz_Cyrl`. | Model | Exact Match | CER ↓ | Word Prec | Word Rec | Word F0.5 ↑ | Identity | |-------|-------------|--------|-----------|----------|-------------|---------| | **sozkz-fix-mt5-50m-kk-gec-v1** (baseline) | **62.0%** | **0.0802** | **0.494** | **0.661** | **0.520** | **100%** | | sozkz-nllb-1b-kk-pretrain-v1 (this model) | 43.5% | 0.2643 | 0.206 | 0.543 | 0.235 | 61.5% | | sozkz-nllb-1b-kk-gec-v1 | 44.0% | 0.2447 | 0.233 | 0.550 | 0.264 | 61.5% | > Fine-tuning on 19K curated GEC examples (stage 2) yields marginal improvement over pretrain > on the standard Word F0.5 metric. The 1.3B NLLB model lags behind the 50M mT5 baseline > due to language-switching on non-Kazakh tokens. ## Inference ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM import torch model_id = "stukenov/sozkz-nllb-1b-kk-pretrain-v1" tokenizer = AutoTokenizer.from_pretrained(model_id, src_lang="kaz_Cyrl", tgt_lang="kaz_Cyrl") model = AutoModelForSeq2SeqLM.from_pretrained(model_id, torch_dtype=torch.bfloat16) model = model.to("cuda" if torch.cuda.is_available() else "cpu") def correct(text: str) -> str: forced_bos = tokenizer.convert_tokens_to_ids("kaz_Cyrl") inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True).to(model.device) with torch.no_grad(): out = model.generate(**inputs, forced_bos_token_id=forced_bos, num_beams=5, max_length=256) return tokenizer.decode(out[0], skip_special_tokens=True) ``` ## Fine-tuning from this checkpoint ```python from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq training_args = Seq2SeqTrainingArguments( output_dir="./my-kk-gec", num_train_epochs=3, per_device_train_batch_size=16, gradient_accumulation_steps=4, learning_rate=5e-6, bf16=True, predict_with_generate=True, save_only_model=True, ) # Pass your tokenized dataset and this model to Seq2SeqTrainer ``` ## Related - GEC fine-tuned: [stukenov/sozkz-nllb-1b-kk-gec-v1](https://huggingface.co/stukenov/sozkz-nllb-1b-kk-gec-v1) - Recommended baseline: [saken-tukenov/sozkz-fix-mt5-50m-kk-gec-v1](https://huggingface.co/saken-tukenov/sozkz-fix-mt5-50m-kk-gec-v1) - Pretrain dataset: [stukenov/sozkz-corpus-pretrain-gec-mix-v1](https://huggingface.co/datasets/stukenov/sozkz-corpus-pretrain-gec-mix-v1) - GEC benchmark: [stukenov/sozkz-corpus-gec-benchmark-kk-v1](https://huggingface.co/datasets/stukenov/sozkz-corpus-gec-benchmark-kk-v1) ## Benchmark Results Evaluated on **100-example custom GEC test** (pure model inference, no pre/post pipeline). | Category | Score | |----------|-------| | Орфография (емле) | 0/30 (0%) | | Грамматика | 1/20 (5%) | | Пунктуация | 0/15 (0%) | | Смешанный | 0/20 (0%) | | Identity preservation | 0/15 (0%) | | **Total** | **1/100 (1%)** | ## Leaderboard (100-example custom benchmark) | Модель | Total | Емле/30 | Грамм/20 | Пункт/15 | Смеш/20 | Ident/15 | |--------|-------|---------|----------|----------|---------|---------| | **[sozkz-core-llama-600m-kk-gec-v1](https://huggingface.co/stukenov/sozkz-core-llama-600m-kk-gec-v1)** | **47%** | 15 | 12 | 3 | 2 | 15/15 | | [sozkz-fix-qwen-500m-kk-gec-v3](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v3) | 38% | 0 | 16 | 9 | 0 | 13/15 | | [sozkz-core-llama-300m-kk-gec-v4](https://huggingface.co/stukenov/sozkz-core-llama-300m-kk-gec-v4) | 37% | 9 | 6 | 4 | 3 | 15/15 | | [sozkz-fix-qwen-500m-kk-gec-v1](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v1) | 35% | 0 | 12 | 8 | 0 | 15/15 | | [sozkz-fix-qwen-500m-kk-gec-v2](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v2) | 30% | 0 | 11 | 7 | 0 | 12/15 | | [sozkz-core-llama-1b-kk-gec-v1](https://huggingface.co/stukenov/sozkz-core-llama-1b-kk-gec-v1) | 16% | 2 | 6 | 1 | 0 | 7/15 | | [sozkz-fix-qwen-500m-kk-gec-v4](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v4) | 5% | 0 | 1 | 4 | 0 | 0/15 | | [sozkz-fix-mt5b-kk-gec-run13-v1](https://huggingface.co/stukenov/sozkz-fix-mt5b-kk-gec-run13-v1) | 5% | 0 | 2 | 0 | 0 | 3/15 | | [sozkz-nllb-1b-kk-gec-v1](https://huggingface.co/stukenov/sozkz-nllb-1b-kk-gec-v1) | 1% | 0 | 1 | 0 | 0 | 0/15 | | [sozkz-nllb-1b-kk-pretrain-v1](https://huggingface.co/stukenov/sozkz-nllb-1b-kk-pretrain-v1) | 1% | 0 | 1 | 0 | 0 | 0/15 | | [sozkz-core-llama-300m-kk-gec-v3](https://huggingface.co/stukenov/sozkz-core-llama-300m-kk-gec-v3) | 1% | 0 | 1 | 0 | 0 | 0/15 | | sozkz-core-llama-300m-kk-gec-v1/v2a/v2b | 0–1% | 0 | 0 | 0 | 0 | 0–1 | | sozkz-fix-mt5-50m-kk-gec-v1 | 0% | 0 | 0 | 0 | 0 | 0/15 |