stukenov/sozkz-corpus-synthetic-kk-gec-v1
Viewer • Updated • 19.3k • 8
Kazakh grammatical error correction (GEC) model based on Llama 1B architecture.
Simple two-line format. Input on first line, model generates correction on second line:
{input text}
{corrected text}
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "stukenov/sozkz-core-llama-1b-kk-gec-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16).to("cuda")
if tokenizer.eos_token is None:
tokenizer.add_special_tokens({"eos_token": "<|endoftext|>"})
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
def correct(text):
prompt = text + "\n"
enc = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
prompt_len = enc["input_ids"].shape[1]
with torch.no_grad():
out = model.generate(
**enc, max_new_tokens=256,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
do_sample=False,
)
return tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True).strip()
print(correct("Ол мектепке бардым."))
# Ол мектепке барды.
| Metric | Value |
|---|---|
| CER | 0.019 |
| Word Precision | 0.704 |
| Word Recall | 0.575 |
| Word F0.5 | 0.673 |
| Identity Preservation | 97.2% |
Strengths:
Limitations:
| Parameter | Value |
|---|---|
| Architecture | Llama 1.08B |
| Method | Full fine-tune |
| Learning rate | 1e-5 |
| Epochs | 3 |
| Effective batch size | 128 |
| Max sequence length | 512 |
| Precision | bf16 |
| Clean ratio | 80% |
| Hardware | 1x NVIDIA H100 80GB |
| Training time | 57 minutes |
MIT
Evaluated on 100-example custom GEC test (pure model inference, no pre/post pipeline).
| Category | Score |
|---|---|
| Орфография (емле) | 2/30 (7%) |
| Грамматика | 6/20 (30%) |
| Пунктуация | 1/15 (7%) |
| Смешанный | 0/20 (0%) |
| Identity preservation | 7/15 (47%) |
| Total | 16/100 (16%) |
| Модель | Total | Емле/30 | Грамм/20 | Пункт/15 | Смеш/20 | Ident/15 |
|---|---|---|---|---|---|---|
| sozkz-core-llama-600m-kk-gec-v1 | 47% | 15 | 12 | 3 | 2 | 15/15 |
| sozkz-fix-qwen-500m-kk-gec-v3 | 38% | 0 | 16 | 9 | 0 | 13/15 |
| sozkz-core-llama-300m-kk-gec-v4 | 37% | 9 | 6 | 4 | 3 | 15/15 |
| sozkz-fix-qwen-500m-kk-gec-v1 | 35% | 0 | 12 | 8 | 0 | 15/15 |
| sozkz-fix-qwen-500m-kk-gec-v2 | 30% | 0 | 11 | 7 | 0 | 12/15 |
| sozkz-core-llama-1b-kk-gec-v1 | 16% | 2 | 6 | 1 | 0 | 7/15 |
| sozkz-fix-qwen-500m-kk-gec-v4 | 5% | 0 | 1 | 4 | 0 | 0/15 |
| sozkz-fix-mt5b-kk-gec-run13-v1 | 5% | 0 | 2 | 0 | 0 | 3/15 |
| sozkz-nllb-1b-kk-gec-v1 | 1% | 0 | 1 | 0 | 0 | 0/15 |
| sozkz-nllb-1b-kk-pretrain-v1 | 1% | 0 | 1 | 0 | 0 | 0/15 |
| sozkz-core-llama-300m-kk-gec-v3 | 1% | 0 | 1 | 0 | 0 | 0/15 |
| sozkz-core-llama-300m-kk-gec-v1/v2a/v2b | 0–1% | 0 | 0 | 0 | 0 | 0–1 |
| sozkz-fix-mt5-50m-kk-gec-v1 | 0% | 0 | 0 | 0 | 0 | 0/15 |
Base model
stukenov/sozkz-core-llama-1b-kk-base-v1