sozkz-nllb-1b-kk-pretrain-v1
NLLB-200-1.3B pretrained on a mixed GEC + translation dataset for Kazakh. Use this as a starting point for further fine-tuning on Kazakh GEC or kk↔en/ru translation tasks.
Training
| Dataset | Pairs | Epochs | LR | Hardware |
|---|---|---|---|---|
| sozkz-corpus-pretrain-gec-mix-v1 | 1.77M | 2 | 2e-5 | 1× H100 SXM 80GB, bf16 |
The pretrain mix contains:
- Kazakh GEC pairs (noisy → corrected)
- kk↔en and kk↔ru translation pairs
Evaluation
Evaluated on the canonical 200-example Kazakh GEC test set
(stukenov/sozkz-corpus-gec-benchmark-kk-v1, split test)
using beam=5, kaz_Cyrl → kaz_Cyrl.
| Model | Exact Match | CER ↓ | Word Prec | Word Rec | Word F0.5 ↑ | Identity |
|---|---|---|---|---|---|---|
| sozkz-fix-mt5-50m-kk-gec-v1 (baseline) | 62.0% | 0.0802 | 0.494 | 0.661 | 0.520 | 100% |
| sozkz-nllb-1b-kk-pretrain-v1 (this model) | 43.5% | 0.2643 | 0.206 | 0.543 | 0.235 | 61.5% |
| sozkz-nllb-1b-kk-gec-v1 | 44.0% | 0.2447 | 0.233 | 0.550 | 0.264 | 61.5% |
Fine-tuning on 19K curated GEC examples (stage 2) yields marginal improvement over pretrain on the standard Word F0.5 metric. The 1.3B NLLB model lags behind the 50M mT5 baseline due to language-switching on non-Kazakh tokens.
Inference
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
model_id = "stukenov/sozkz-nllb-1b-kk-pretrain-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id, src_lang="kaz_Cyrl", tgt_lang="kaz_Cyrl")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model = model.to("cuda" if torch.cuda.is_available() else "cpu")
def correct(text: str) -> str:
forced_bos = tokenizer.convert_tokens_to_ids("kaz_Cyrl")
inputs = tokenizer(text, return_tensors="pt", max_length=256,
truncation=True).to(model.device)
with torch.no_grad():
out = model.generate(**inputs, forced_bos_token_id=forced_bos,
num_beams=5, max_length=256)
return tokenizer.decode(out[0], skip_special_tokens=True)
Fine-tuning from this checkpoint
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq
training_args = Seq2SeqTrainingArguments(
output_dir="./my-kk-gec",
num_train_epochs=3,
per_device_train_batch_size=16,
gradient_accumulation_steps=4,
learning_rate=5e-6,
bf16=True,
predict_with_generate=True,
save_only_model=True,
)
# Pass your tokenized dataset and this model to Seq2SeqTrainer
Related
- GEC fine-tuned: stukenov/sozkz-nllb-1b-kk-gec-v1
- Recommended baseline: saken-tukenov/sozkz-fix-mt5-50m-kk-gec-v1
- Pretrain dataset: stukenov/sozkz-corpus-pretrain-gec-mix-v1
- GEC benchmark: stukenov/sozkz-corpus-gec-benchmark-kk-v1
Benchmark Results
Evaluated on 100-example custom GEC test (pure model inference, no pre/post pipeline).
| Category | Score |
|---|---|
| Орфография (емле) | 0/30 (0%) |
| Грамматика | 1/20 (5%) |
| Пунктуация | 0/15 (0%) |
| Смешанный | 0/20 (0%) |
| Identity preservation | 0/15 (0%) |
| Total | 1/100 (1%) |
Leaderboard (100-example custom benchmark)
| Модель | Total | Емле/30 | Грамм/20 | Пункт/15 | Смеш/20 | Ident/15 |
|---|---|---|---|---|---|---|
| sozkz-core-llama-600m-kk-gec-v1 | 47% | 15 | 12 | 3 | 2 | 15/15 |
| sozkz-fix-qwen-500m-kk-gec-v3 | 38% | 0 | 16 | 9 | 0 | 13/15 |
| sozkz-core-llama-300m-kk-gec-v4 | 37% | 9 | 6 | 4 | 3 | 15/15 |
| sozkz-fix-qwen-500m-kk-gec-v1 | 35% | 0 | 12 | 8 | 0 | 15/15 |
| sozkz-fix-qwen-500m-kk-gec-v2 | 30% | 0 | 11 | 7 | 0 | 12/15 |
| sozkz-core-llama-1b-kk-gec-v1 | 16% | 2 | 6 | 1 | 0 | 7/15 |
| sozkz-fix-qwen-500m-kk-gec-v4 | 5% | 0 | 1 | 4 | 0 | 0/15 |
| sozkz-fix-mt5b-kk-gec-run13-v1 | 5% | 0 | 2 | 0 | 0 | 3/15 |
| sozkz-nllb-1b-kk-gec-v1 | 1% | 0 | 1 | 0 | 0 | 0/15 |
| sozkz-nllb-1b-kk-pretrain-v1 | 1% | 0 | 1 | 0 | 0 | 0/15 |
| sozkz-core-llama-300m-kk-gec-v3 | 1% | 0 | 1 | 0 | 0 | 0/15 |
| sozkz-core-llama-300m-kk-gec-v1/v2a/v2b | 0–1% | 0 | 0 | 0 | 0 | 0–1 |
| sozkz-fix-mt5-50m-kk-gec-v1 | 0% | 0 | 0 | 0 | 0 | 0/15 |
- Downloads last month
- -
Model tree for stukenov/sozkz-nllb-1b-kk-pretrain-v1
Base model
facebook/nllb-200-1.3BDataset used to train stukenov/sozkz-nllb-1b-kk-pretrain-v1
Viewer • Updated • 1.77M • 5
Evaluation results
- Exact Match (100-example custom) on sozkz-corpus-gec-benchmark-kk-v1test set self-reported1.000