You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

sozkz-nllb-1b-kk-pretrain-v1

NLLB-200-1.3B pretrained on a mixed GEC + translation dataset for Kazakh. Use this as a starting point for further fine-tuning on Kazakh GEC or kk↔en/ru translation tasks.

Training

Dataset	Pairs	Epochs	LR	Hardware
sozkz-corpus-pretrain-gec-mix-v1	1.77M	2	2e-5	1× H100 SXM 80GB, bf16

The pretrain mix contains:

Kazakh GEC pairs (noisy → corrected)
kk↔en and kk↔ru translation pairs

Evaluation

Evaluated on the canonical 200-example Kazakh GEC test set (stukenov/sozkz-corpus-gec-benchmark-kk-v1, split test) using beam=5, kaz_Cyrl → kaz_Cyrl.

Model	Exact Match	CER ↓	Word Prec	Word Rec	Word F0.5 ↑	Identity
sozkz-fix-mt5-50m-kk-gec-v1 (baseline)	62.0%	0.0802	0.494	0.661	0.520	100%
sozkz-nllb-1b-kk-pretrain-v1 (this model)	43.5%	0.2643	0.206	0.543	0.235	61.5%
sozkz-nllb-1b-kk-gec-v1	44.0%	0.2447	0.233	0.550	0.264	61.5%

Fine-tuning on 19K curated GEC examples (stage 2) yields marginal improvement over pretrain on the standard Word F0.5 metric. The 1.3B NLLB model lags behind the 50M mT5 baseline due to language-switching on non-Kazakh tokens.

Inference

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id = "stukenov/sozkz-nllb-1b-kk-pretrain-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id, src_lang="kaz_Cyrl", tgt_lang="kaz_Cyrl")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model = model.to("cuda" if torch.cuda.is_available() else "cpu")

def correct(text: str) -> str:
    forced_bos = tokenizer.convert_tokens_to_ids("kaz_Cyrl")
    inputs = tokenizer(text, return_tensors="pt", max_length=256,
                       truncation=True).to(model.device)
    with torch.no_grad():
        out = model.generate(**inputs, forced_bos_token_id=forced_bos,
                             num_beams=5, max_length=256)
    return tokenizer.decode(out[0], skip_special_tokens=True)

Fine-tuning from this checkpoint

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq

training_args = Seq2SeqTrainingArguments(
    output_dir="./my-kk-gec",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    learning_rate=5e-6,
    bf16=True,
    predict_with_generate=True,
    save_only_model=True,
)
# Pass your tokenized dataset and this model to Seq2SeqTrainer

GEC fine-tuned: stukenov/sozkz-nllb-1b-kk-gec-v1
Recommended baseline: saken-tukenov/sozkz-fix-mt5-50m-kk-gec-v1
Pretrain dataset: stukenov/sozkz-corpus-pretrain-gec-mix-v1
GEC benchmark: stukenov/sozkz-corpus-gec-benchmark-kk-v1

Benchmark Results

Evaluated on 100-example custom GEC test (pure model inference, no pre/post pipeline).

Category	Score
Орфография (емле)	0/30 (0%)
Грамматика	1/20 (5%)
Пунктуация	0/15 (0%)
Смешанный	0/20 (0%)
Identity preservation	0/15 (0%)
Total	1/100 (1%)

Leaderboard (100-example custom benchmark)

Модель	Total	Емле/30	Грамм/20	Пункт/15	Смеш/20	Ident/15
sozkz-core-llama-600m-kk-gec-v1	47%	15	12	3	2	15/15
sozkz-fix-qwen-500m-kk-gec-v3	38%	0	16	9	0	13/15
sozkz-core-llama-300m-kk-gec-v4	37%	9	6	4	3	15/15
sozkz-fix-qwen-500m-kk-gec-v1	35%	0	12	8	0	15/15
sozkz-fix-qwen-500m-kk-gec-v2	30%	0	11	7	0	12/15
sozkz-core-llama-1b-kk-gec-v1	16%	2	6	1	0	7/15
sozkz-fix-qwen-500m-kk-gec-v4	5%	0	1	4	0	0/15
sozkz-fix-mt5b-kk-gec-run13-v1	5%	0	2	0	0	3/15
sozkz-nllb-1b-kk-gec-v1	1%	0	1	0	0	0/15
sozkz-nllb-1b-kk-pretrain-v1	1%	0	1	0	0	0/15
sozkz-core-llama-300m-kk-gec-v3	1%	0	1	0	0	0/15
sozkz-core-llama-300m-kk-gec-v1/v2a/v2b	0–1%	0	0	0	0	0–1
sozkz-fix-mt5-50m-kk-gec-v1	0%	0	0	0	0	0/15

Downloads last month: -

Safetensors

Model size

1B params

Tensor type

F32

Model tree for stukenov/sozkz-nllb-1b-kk-pretrain-v1

Base model

facebook/nllb-200-1.3B

Finetuned

(29)

this model

Dataset used to train stukenov/sozkz-nllb-1b-kk-pretrain-v1

Evaluation results

Exact Match (100-example custom) on sozkz-corpus-gec-benchmark-kk-v1
test set self-reported

1.000

stukenov
/

sozkz-nllb-1b-kk-pretrain-v1