You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

sozkz-nllb-1b-kk-pretrain-v1

NLLB-200-1.3B pretrained on a mixed GEC + translation dataset for Kazakh. Use this as a starting point for further fine-tuning on Kazakh GEC or kk↔en/ru translation tasks.

Training

Dataset Pairs Epochs LR Hardware
sozkz-corpus-pretrain-gec-mix-v1 1.77M 2 2e-5 1× H100 SXM 80GB, bf16

The pretrain mix contains:

  • Kazakh GEC pairs (noisy → corrected)
  • kk↔en and kk↔ru translation pairs

Evaluation

Evaluated on the canonical 200-example Kazakh GEC test set (stukenov/sozkz-corpus-gec-benchmark-kk-v1, split test) using beam=5, kaz_Cyrl → kaz_Cyrl.

Model Exact Match CER ↓ Word Prec Word Rec Word F0.5 ↑ Identity
sozkz-fix-mt5-50m-kk-gec-v1 (baseline) 62.0% 0.0802 0.494 0.661 0.520 100%
sozkz-nllb-1b-kk-pretrain-v1 (this model) 43.5% 0.2643 0.206 0.543 0.235 61.5%
sozkz-nllb-1b-kk-gec-v1 44.0% 0.2447 0.233 0.550 0.264 61.5%

Fine-tuning on 19K curated GEC examples (stage 2) yields marginal improvement over pretrain on the standard Word F0.5 metric. The 1.3B NLLB model lags behind the 50M mT5 baseline due to language-switching on non-Kazakh tokens.

Inference

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id = "stukenov/sozkz-nllb-1b-kk-pretrain-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id, src_lang="kaz_Cyrl", tgt_lang="kaz_Cyrl")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model = model.to("cuda" if torch.cuda.is_available() else "cpu")

def correct(text: str) -> str:
    forced_bos = tokenizer.convert_tokens_to_ids("kaz_Cyrl")
    inputs = tokenizer(text, return_tensors="pt", max_length=256,
                       truncation=True).to(model.device)
    with torch.no_grad():
        out = model.generate(**inputs, forced_bos_token_id=forced_bos,
                             num_beams=5, max_length=256)
    return tokenizer.decode(out[0], skip_special_tokens=True)

Fine-tuning from this checkpoint

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq

training_args = Seq2SeqTrainingArguments(
    output_dir="./my-kk-gec",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    learning_rate=5e-6,
    bf16=True,
    predict_with_generate=True,
    save_only_model=True,
)
# Pass your tokenized dataset and this model to Seq2SeqTrainer

Related

Benchmark Results

Evaluated on 100-example custom GEC test (pure model inference, no pre/post pipeline).

Category Score
Орфография (емле) 0/30 (0%)
Грамматика 1/20 (5%)
Пунктуация 0/15 (0%)
Смешанный 0/20 (0%)
Identity preservation 0/15 (0%)
Total 1/100 (1%)

Leaderboard (100-example custom benchmark)

Модель Total Емле/30 Грамм/20 Пункт/15 Смеш/20 Ident/15
sozkz-core-llama-600m-kk-gec-v1 47% 15 12 3 2 15/15
sozkz-fix-qwen-500m-kk-gec-v3 38% 0 16 9 0 13/15
sozkz-core-llama-300m-kk-gec-v4 37% 9 6 4 3 15/15
sozkz-fix-qwen-500m-kk-gec-v1 35% 0 12 8 0 15/15
sozkz-fix-qwen-500m-kk-gec-v2 30% 0 11 7 0 12/15
sozkz-core-llama-1b-kk-gec-v1 16% 2 6 1 0 7/15
sozkz-fix-qwen-500m-kk-gec-v4 5% 0 1 4 0 0/15
sozkz-fix-mt5b-kk-gec-run13-v1 5% 0 2 0 0 3/15
sozkz-nllb-1b-kk-gec-v1 1% 0 1 0 0 0/15
sozkz-nllb-1b-kk-pretrain-v1 1% 0 1 0 0 0/15
sozkz-core-llama-300m-kk-gec-v3 1% 0 1 0 0 0/15
sozkz-core-llama-300m-kk-gec-v1/v2a/v2b 0–1% 0 0 0 0 0–1
sozkz-fix-mt5-50m-kk-gec-v1 0% 0 0 0 0 0/15
Downloads last month
-
Safetensors
Model size
1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stukenov/sozkz-nllb-1b-kk-pretrain-v1

Finetuned
(29)
this model

Dataset used to train stukenov/sozkz-nllb-1b-kk-pretrain-v1

Evaluation results