You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

sozkz-core-llama-1b-kk-gec-v1

Kazakh grammatical error correction (GEC) model based on Llama 1B architecture.

Model Details

Format

Simple two-line format. Input on first line, model generates correction on second line:

{input text}
{corrected text}

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "stukenov/sozkz-core-llama-1b-kk-gec-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16).to("cuda")

if tokenizer.eos_token is None:
    tokenizer.add_special_tokens({"eos_token": "<|endoftext|>"})
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def correct(text):
    prompt = text + "\n"
    enc = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
    prompt_len = enc["input_ids"].shape[1]
    with torch.no_grad():
        out = model.generate(
            **enc, max_new_tokens=256,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
            do_sample=False,
        )
    return tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True).strip()

print(correct("Ол мектепке бардым."))
# Ол мектепке барды.

Results

Metric Value
CER 0.019
Word Precision 0.704
Word Recall 0.575
Word F0.5 0.673
Identity Preservation 97.2%

Strengths:

  • Very high identity preservation (97.2%) -- does not corrupt clean text
  • Good precision on corrections it makes (70.4%)

Limitations:

  • Conservative -- prefers not to change text when unsure
  • Recall is moderate (57.5%) -- misses some errors
  • Trained on synthetic data -- may not cover all real-world error patterns

Training

Parameter Value
Architecture Llama 1.08B
Method Full fine-tune
Learning rate 1e-5
Epochs 3
Effective batch size 128
Max sequence length 512
Precision bf16
Clean ratio 80%
Hardware 1x NVIDIA H100 80GB
Training time 57 minutes

License

MIT

Benchmark Results

Evaluated on 100-example custom GEC test (pure model inference, no pre/post pipeline).

Category Score
Орфография (емле) 2/30 (7%)
Грамматика 6/20 (30%)
Пунктуация 1/15 (7%)
Смешанный 0/20 (0%)
Identity preservation 7/15 (47%)
Total 16/100 (16%)

Leaderboard (100-example custom benchmark)

Модель Total Емле/30 Грамм/20 Пункт/15 Смеш/20 Ident/15
sozkz-core-llama-600m-kk-gec-v1 47% 15 12 3 2 15/15
sozkz-fix-qwen-500m-kk-gec-v3 38% 0 16 9 0 13/15
sozkz-core-llama-300m-kk-gec-v4 37% 9 6 4 3 15/15
sozkz-fix-qwen-500m-kk-gec-v1 35% 0 12 8 0 15/15
sozkz-fix-qwen-500m-kk-gec-v2 30% 0 11 7 0 12/15
sozkz-core-llama-1b-kk-gec-v1 16% 2 6 1 0 7/15
sozkz-fix-qwen-500m-kk-gec-v4 5% 0 1 4 0 0/15
sozkz-fix-mt5b-kk-gec-run13-v1 5% 0 2 0 0 3/15
sozkz-nllb-1b-kk-gec-v1 1% 0 1 0 0 0/15
sozkz-nllb-1b-kk-pretrain-v1 1% 0 1 0 0 0/15
sozkz-core-llama-300m-kk-gec-v3 1% 0 1 0 0 0/15
sozkz-core-llama-300m-kk-gec-v1/v2a/v2b 0–1% 0 0 0 0 0–1
sozkz-fix-mt5-50m-kk-gec-v1 0% 0 0 0 0 0/15
Downloads last month
-
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stukenov/sozkz-core-llama-1b-kk-gec-v1

Finetuned
(1)
this model

Dataset used to train stukenov/sozkz-core-llama-1b-kk-gec-v1