Add 100-example GEC benchmark results (16%)

591125b verified 2 months ago

5 kB

language:
  - kk
license: mit
tags:
  - gec
  - grammar-correction
  - kazakh
  - llama
datasets:
  - stukenov/sozkz-corpus-synthetic-kk-gec-v1
base_model: stukenov/sozkz-core-llama-1b-kk-base-v1
pipeline_tag: text-generation

sozkz-core-llama-1b-kk-gec-v1

Kazakh grammatical error correction (GEC) model based on Llama 1B architecture.

Model Details

Base model: stukenov/sozkz-core-llama-1b-kk-base-v1
Method: Full fine-tune (all 1.08B parameters)
Dataset: stukenov/sozkz-corpus-synthetic-kk-gec-v1 (216K error pairs + 216K clean examples)
Training: 3 epochs, LR=1e-5, cosine scheduler, bf16, BS=64, 1xH100 80GB, 57 min
Final loss: 0.118

Format

Simple two-line format. Input on first line, model generates correction on second line:

{input text}
{corrected text}

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "stukenov/sozkz-core-llama-1b-kk-gec-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16).to("cuda")

if tokenizer.eos_token is None:
    tokenizer.add_special_tokens({"eos_token": "<|endoftext|>"})
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def correct(text):
    prompt = text + "\n"
    enc = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
    prompt_len = enc["input_ids"].shape[1]
    with torch.no_grad():
        out = model.generate(
            **enc, max_new_tokens=256,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
            do_sample=False,
        )
    return tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True).strip()

print(correct("Ол мектепке бардым."))
# Ол мектепке барды.

Results

Metric	Value
CER	0.019
Word Precision	0.704
Word Recall	0.575
Word F0.5	0.673
Identity Preservation	97.2%

Strengths:

Very high identity preservation (97.2%) -- does not corrupt clean text
Good precision on corrections it makes (70.4%)

Limitations:

Conservative -- prefers not to change text when unsure
Recall is moderate (57.5%) -- misses some errors
Trained on synthetic data -- may not cover all real-world error patterns

Training

Parameter	Value
Architecture	Llama 1.08B
Method	Full fine-tune
Learning rate	1e-5
Epochs	3
Effective batch size	128
Max sequence length	512
Precision	bf16
Clean ratio	80%
Hardware	1x NVIDIA H100 80GB
Training time	57 minutes

License

MIT

Benchmark Results

Evaluated on 100-example custom GEC test (pure model inference, no pre/post pipeline).

Category	Score
Орфография (емле)	2/30 (7%)
Грамматика	6/20 (30%)
Пунктуация	1/15 (7%)
Смешанный	0/20 (0%)
Identity preservation	7/15 (47%)
Total	16/100 (16%)

Leaderboard (100-example custom benchmark)

Модель	Total	Емле/30	Грамм/20	Пункт/15	Смеш/20	Ident/15
sozkz-core-llama-600m-kk-gec-v1	47%	15	12	3	2	15/15
sozkz-fix-qwen-500m-kk-gec-v3	38%	0	16	9	0	13/15
sozkz-core-llama-300m-kk-gec-v4	37%	9	6	4	3	15/15
sozkz-fix-qwen-500m-kk-gec-v1	35%	0	12	8	0	15/15
sozkz-fix-qwen-500m-kk-gec-v2	30%	0	11	7	0	12/15
sozkz-core-llama-1b-kk-gec-v1	16%	2	6	1	0	7/15
sozkz-fix-qwen-500m-kk-gec-v4	5%	0	1	4	0	0/15
sozkz-fix-mt5b-kk-gec-run13-v1	5%	0	2	0	0	3/15
sozkz-nllb-1b-kk-gec-v1	1%	0	1	0	0	0/15
sozkz-nllb-1b-kk-pretrain-v1	1%	0	1	0	0	0/15
sozkz-core-llama-300m-kk-gec-v3	1%	0	1	0	0	0/15
sozkz-core-llama-300m-kk-gec-v1/v2a/v2b	0–1%	0	0	0	0	0–1
sozkz-fix-mt5-50m-kk-gec-v1	0%	0	0	0	0	0/15