stukenov's picture
Add 100-example GEC benchmark results (16%)
591125b verified
|
Raw
History Blame Contribute Delete
5 kB
metadata
language:
  - kk
license: mit
tags:
  - gec
  - grammar-correction
  - kazakh
  - llama
datasets:
  - stukenov/sozkz-corpus-synthetic-kk-gec-v1
base_model: stukenov/sozkz-core-llama-1b-kk-base-v1
pipeline_tag: text-generation

sozkz-core-llama-1b-kk-gec-v1

Kazakh grammatical error correction (GEC) model based on Llama 1B architecture.

Model Details

Format

Simple two-line format. Input on first line, model generates correction on second line:

{input text}
{corrected text}

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "stukenov/sozkz-core-llama-1b-kk-gec-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16).to("cuda")

if tokenizer.eos_token is None:
    tokenizer.add_special_tokens({"eos_token": "<|endoftext|>"})
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def correct(text):
    prompt = text + "\n"
    enc = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
    prompt_len = enc["input_ids"].shape[1]
    with torch.no_grad():
        out = model.generate(
            **enc, max_new_tokens=256,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
            do_sample=False,
        )
    return tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True).strip()

print(correct("Ол мектепке бардым."))
# Ол мектепке барды.

Results

Metric Value
CER 0.019
Word Precision 0.704
Word Recall 0.575
Word F0.5 0.673
Identity Preservation 97.2%

Strengths:

  • Very high identity preservation (97.2%) -- does not corrupt clean text
  • Good precision on corrections it makes (70.4%)

Limitations:

  • Conservative -- prefers not to change text when unsure
  • Recall is moderate (57.5%) -- misses some errors
  • Trained on synthetic data -- may not cover all real-world error patterns

Training

Parameter Value
Architecture Llama 1.08B
Method Full fine-tune
Learning rate 1e-5
Epochs 3
Effective batch size 128
Max sequence length 512
Precision bf16
Clean ratio 80%
Hardware 1x NVIDIA H100 80GB
Training time 57 minutes

License

MIT

Benchmark Results

Evaluated on 100-example custom GEC test (pure model inference, no pre/post pipeline).

Category Score
Орфография (емле) 2/30 (7%)
Грамматика 6/20 (30%)
Пунктуация 1/15 (7%)
Смешанный 0/20 (0%)
Identity preservation 7/15 (47%)
Total 16/100 (16%)

Leaderboard (100-example custom benchmark)

Модель Total Емле/30 Грамм/20 Пункт/15 Смеш/20 Ident/15
sozkz-core-llama-600m-kk-gec-v1 47% 15 12 3 2 15/15
sozkz-fix-qwen-500m-kk-gec-v3 38% 0 16 9 0 13/15
sozkz-core-llama-300m-kk-gec-v4 37% 9 6 4 3 15/15
sozkz-fix-qwen-500m-kk-gec-v1 35% 0 12 8 0 15/15
sozkz-fix-qwen-500m-kk-gec-v2 30% 0 11 7 0 12/15
sozkz-core-llama-1b-kk-gec-v1 16% 2 6 1 0 7/15
sozkz-fix-qwen-500m-kk-gec-v4 5% 0 1 4 0 0/15
sozkz-fix-mt5b-kk-gec-run13-v1 5% 0 2 0 0 3/15
sozkz-nllb-1b-kk-gec-v1 1% 0 1 0 0 0/15
sozkz-nllb-1b-kk-pretrain-v1 1% 0 1 0 0 0/15
sozkz-core-llama-300m-kk-gec-v3 1% 0 1 0 0 0/15
sozkz-core-llama-300m-kk-gec-v1/v2a/v2b 0–1% 0 0 0 0 0–1
sozkz-fix-mt5-50m-kk-gec-v1 0% 0 0 0 0 0/15