---
language:
  - kk
  - en
  - ru
license: mit
tags:
  - kazakh
  - gec
  - translation
  - nllb
  - seq2seq
  - pretrained
datasets:
  - stukenov/sozkz-corpus-pretrain-gec-mix-v1
base_model: facebook/nllb-200-1.3B
model-index:
- name: sozkz-nllb-1b-kk-pretrain-v1
  results:
  - task:
      type: text-generation
      name: Grammatical Error Correction
    dataset:
      name: sozkz-corpus-gec-benchmark-kk-v1
      type: stukenov/sozkz-corpus-gec-benchmark-kk-v1
      split: test
      config: default
    metrics:
    - name: Exact Match (100-example custom)
      type: exact_match
      value: 1
---
# sozkz-nllb-1b-kk-pretrain-v1

NLLB-200-1.3B pretrained on a **mixed GEC + translation dataset** for Kazakh.
Use this as a starting point for further fine-tuning on Kazakh GEC or kk↔en/ru translation tasks.

## Training

| Dataset | Pairs | Epochs | LR | Hardware |
|---------|-------|--------|----|----------|
| [sozkz-corpus-pretrain-gec-mix-v1](https://huggingface.co/datasets/stukenov/sozkz-corpus-pretrain-gec-mix-v1) | 1.77M | 2 | 2e-5 | 1× H100 SXM 80GB, bf16 |

The pretrain mix contains:
- Kazakh GEC pairs (noisy → corrected)
- kk↔en and kk↔ru translation pairs

## Evaluation

Evaluated on the canonical **200-example** Kazakh GEC test set
([stukenov/sozkz-corpus-gec-benchmark-kk-v1](https://huggingface.co/datasets/stukenov/sozkz-corpus-gec-benchmark-kk-v1), split `test`)
using `beam=5`, `kaz_Cyrl → kaz_Cyrl`.

| Model | Exact Match | CER ↓ | Word Prec | Word Rec | Word F0.5 ↑ | Identity |
|-------|-------------|--------|-----------|----------|-------------|---------|
| **sozkz-fix-mt5-50m-kk-gec-v1** (baseline) | **62.0%** | **0.0802** | **0.494** | **0.661** | **0.520** | **100%** |
| sozkz-nllb-1b-kk-pretrain-v1 (this model) | 43.5% | 0.2643 | 0.206 | 0.543 | 0.235 | 61.5% |
| sozkz-nllb-1b-kk-gec-v1 | 44.0% | 0.2447 | 0.233 | 0.550 | 0.264 | 61.5% |

> Fine-tuning on 19K curated GEC examples (stage 2) yields marginal improvement over pretrain
> on the standard Word F0.5 metric. The 1.3B NLLB model lags behind the 50M mT5 baseline
> due to language-switching on non-Kazakh tokens.

## Inference

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id = "stukenov/sozkz-nllb-1b-kk-pretrain-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id, src_lang="kaz_Cyrl", tgt_lang="kaz_Cyrl")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model = model.to("cuda" if torch.cuda.is_available() else "cpu")

def correct(text: str) -> str:
    forced_bos = tokenizer.convert_tokens_to_ids("kaz_Cyrl")
    inputs = tokenizer(text, return_tensors="pt", max_length=256,
                       truncation=True).to(model.device)
    with torch.no_grad():
        out = model.generate(**inputs, forced_bos_token_id=forced_bos,
                             num_beams=5, max_length=256)
    return tokenizer.decode(out[0], skip_special_tokens=True)
```

## Fine-tuning from this checkpoint

```python
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq

training_args = Seq2SeqTrainingArguments(
    output_dir="./my-kk-gec",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    learning_rate=5e-6,
    bf16=True,
    predict_with_generate=True,
    save_only_model=True,
)
# Pass your tokenized dataset and this model to Seq2SeqTrainer
```

## Related

- GEC fine-tuned: [stukenov/sozkz-nllb-1b-kk-gec-v1](https://huggingface.co/stukenov/sozkz-nllb-1b-kk-gec-v1)
- Recommended baseline: [saken-tukenov/sozkz-fix-mt5-50m-kk-gec-v1](https://huggingface.co/saken-tukenov/sozkz-fix-mt5-50m-kk-gec-v1)
- Pretrain dataset: [stukenov/sozkz-corpus-pretrain-gec-mix-v1](https://huggingface.co/datasets/stukenov/sozkz-corpus-pretrain-gec-mix-v1)
- GEC benchmark: [stukenov/sozkz-corpus-gec-benchmark-kk-v1](https://huggingface.co/datasets/stukenov/sozkz-corpus-gec-benchmark-kk-v1)

## Benchmark Results

Evaluated on **100-example custom GEC test** (pure model inference, no pre/post pipeline).

| Category | Score |
|----------|-------|
| Орфография (емле) | 0/30 (0%) |
| Грамматика | 1/20 (5%) |
| Пунктуация | 0/15 (0%) |
| Смешанный | 0/20 (0%) |
| Identity preservation | 0/15 (0%) |
| **Total** | **1/100 (1%)** |


## Leaderboard (100-example custom benchmark)

| Модель | Total | Емле/30 | Грамм/20 | Пункт/15 | Смеш/20 | Ident/15 |
|--------|-------|---------|----------|----------|---------|---------|
| **[sozkz-core-llama-600m-kk-gec-v1](https://huggingface.co/stukenov/sozkz-core-llama-600m-kk-gec-v1)** | **47%** | 15 | 12 | 3 | 2 | 15/15 |
| [sozkz-fix-qwen-500m-kk-gec-v3](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v3) | 38% | 0 | 16 | 9 | 0 | 13/15 |
| [sozkz-core-llama-300m-kk-gec-v4](https://huggingface.co/stukenov/sozkz-core-llama-300m-kk-gec-v4) | 37% | 9 | 6 | 4 | 3 | 15/15 |
| [sozkz-fix-qwen-500m-kk-gec-v1](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v1) | 35% | 0 | 12 | 8 | 0 | 15/15 |
| [sozkz-fix-qwen-500m-kk-gec-v2](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v2) | 30% | 0 | 11 | 7 | 0 | 12/15 |
| [sozkz-core-llama-1b-kk-gec-v1](https://huggingface.co/stukenov/sozkz-core-llama-1b-kk-gec-v1) | 16% | 2 | 6 | 1 | 0 | 7/15 |
| [sozkz-fix-qwen-500m-kk-gec-v4](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v4) | 5% | 0 | 1 | 4 | 0 | 0/15 |
| [sozkz-fix-mt5b-kk-gec-run13-v1](https://huggingface.co/stukenov/sozkz-fix-mt5b-kk-gec-run13-v1) | 5% | 0 | 2 | 0 | 0 | 3/15 |
| [sozkz-nllb-1b-kk-gec-v1](https://huggingface.co/stukenov/sozkz-nllb-1b-kk-gec-v1) | 1% | 0 | 1 | 0 | 0 | 0/15 |
| [sozkz-nllb-1b-kk-pretrain-v1](https://huggingface.co/stukenov/sozkz-nllb-1b-kk-pretrain-v1) | 1% | 0 | 1 | 0 | 0 | 0/15 |
| [sozkz-core-llama-300m-kk-gec-v3](https://huggingface.co/stukenov/sozkz-core-llama-300m-kk-gec-v3) | 1% | 0 | 1 | 0 | 0 | 0/15 |
| sozkz-core-llama-300m-kk-gec-v1/v2a/v2b | 0–1% | 0 | 0 | 0 | 0 | 0–1 |
| sozkz-fix-mt5-50m-kk-gec-v1 | 0% | 0 | 0 | 0 | 0 | 0/15 |