You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

sozkz-core-llama-1b-kk-gec-v1

Kazakh grammatical error correction (GEC) model based on Llama 1B architecture.

Model Details

Base model: stukenov/sozkz-core-llama-1b-kk-base-v1
Method: Full fine-tune (all 1.08B parameters)
Dataset: stukenov/sozkz-corpus-synthetic-kk-gec-v1 (216K error pairs + 216K clean examples)
Training: 3 epochs, LR=1e-5, cosine scheduler, bf16, BS=64, 1xH100 80GB, 57 min
Final loss: 0.118

Format

Simple two-line format. Input on first line, model generates correction on second line:

{input text}
{corrected text}

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "stukenov/sozkz-core-llama-1b-kk-gec-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16).to("cuda")

if tokenizer.eos_token is None:
    tokenizer.add_special_tokens({"eos_token": "<|endoftext|>"})
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def correct(text):
    prompt = text + "\n"
    enc = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
    prompt_len = enc["input_ids"].shape[1]
    with torch.no_grad():
        out = model.generate(
            **enc, max_new_tokens=256,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
            do_sample=False,
        )
    return tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True).strip()

print(correct("Ол мектепке бардым."))
# Ол мектепке барды.

Results

Metric	Value
CER	0.019
Word Precision	0.704
Word Recall	0.575
Word F0.5	0.673
Identity Preservation	97.2%

Strengths:

Very high identity preservation (97.2%) -- does not corrupt clean text
Good precision on corrections it makes (70.4%)

Limitations:

Conservative -- prefers not to change text when unsure
Recall is moderate (57.5%) -- misses some errors
Trained on synthetic data -- may not cover all real-world error patterns

Training

Parameter	Value
Architecture	Llama 1.08B
Method	Full fine-tune
Learning rate	1e-5
Epochs	3
Effective batch size	128
Max sequence length	512
Precision	bf16
Clean ratio	80%
Hardware	1x NVIDIA H100 80GB
Training time	57 minutes

License

MIT

Benchmark Results

Evaluated on 100-example custom GEC test (pure model inference, no pre/post pipeline).

Category	Score
Орфография (емле)	2/30 (7%)
Грамматика	6/20 (30%)
Пунктуация	1/15 (7%)
Смешанный	0/20 (0%)
Identity preservation	7/15 (47%)
Total	16/100 (16%)

Leaderboard (100-example custom benchmark)

Модель	Total	Емле/30	Грамм/20	Пункт/15	Смеш/20	Ident/15
sozkz-core-llama-600m-kk-gec-v1	47%	15	12	3	2	15/15
sozkz-fix-qwen-500m-kk-gec-v3	38%	0	16	9	0	13/15
sozkz-core-llama-300m-kk-gec-v4	37%	9	6	4	3	15/15
sozkz-fix-qwen-500m-kk-gec-v1	35%	0	12	8	0	15/15
sozkz-fix-qwen-500m-kk-gec-v2	30%	0	11	7	0	12/15
sozkz-core-llama-1b-kk-gec-v1	16%	2	6	1	0	7/15
sozkz-fix-qwen-500m-kk-gec-v4	5%	0	1	4	0	0/15
sozkz-fix-mt5b-kk-gec-run13-v1	5%	0	2	0	0	3/15
sozkz-nllb-1b-kk-gec-v1	1%	0	1	0	0	0/15
sozkz-nllb-1b-kk-pretrain-v1	1%	0	1	0	0	0/15
sozkz-core-llama-300m-kk-gec-v3	1%	0	1	0	0	0/15
sozkz-core-llama-300m-kk-gec-v1/v2a/v2b	0–1%	0	0	0	0	0–1
sozkz-fix-mt5-50m-kk-gec-v1	0%	0	0	0	0	0/15

Downloads last month: -

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for stukenov/sozkz-core-llama-1b-kk-gec-v1

Base model

stukenov/sozkz-core-llama-1b-kk-base-v1

Finetuned

(1)

this model

stukenov
/

sozkz-core-llama-1b-kk-gec-v1