---
language:
- kk
license: mit
tags:
- gec
- grammar-correction
- kazakh
- llama
datasets:
- stukenov/sozkz-corpus-synthetic-kk-gec-v1
base_model: stukenov/sozkz-core-llama-1b-kk-base-v1
pipeline_tag: text-generation
---

# sozkz-core-llama-1b-kk-gec-v1

Kazakh grammatical error correction (GEC) model based on Llama 1B architecture.

## Model Details

- **Base model:** [stukenov/sozkz-core-llama-1b-kk-base-v1](https://huggingface.co/stukenov/sozkz-core-llama-1b-kk-base-v1)
- **Method:** Full fine-tune (all 1.08B parameters)
- **Dataset:** [stukenov/sozkz-corpus-synthetic-kk-gec-v1](https://huggingface.co/datasets/stukenov/sozkz-corpus-synthetic-kk-gec-v1) (216K error pairs + 216K clean examples)
- **Training:** 3 epochs, LR=1e-5, cosine scheduler, bf16, BS=64, 1xH100 80GB, 57 min
- **Final loss:** 0.118

## Format

Simple two-line format. Input on first line, model generates correction on second line:

```
{input text}
{corrected text}
```

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "stukenov/sozkz-core-llama-1b-kk-gec-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16).to("cuda")

if tokenizer.eos_token is None:
    tokenizer.add_special_tokens({"eos_token": "<|endoftext|>"})
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def correct(text):
    prompt = text + "\n"
    enc = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
    prompt_len = enc["input_ids"].shape[1]
    with torch.no_grad():
        out = model.generate(
            **enc, max_new_tokens=256,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
            do_sample=False,
        )
    return tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True).strip()

print(correct("Ол мектепке бардым."))
# Ол мектепке барды.
```

## Results

| Metric | Value |
|---|---|
| CER | 0.019 |
| Word Precision | 0.704 |
| Word Recall | 0.575 |
| Word F0.5 | 0.673 |
| Identity Preservation | 97.2% |

**Strengths:**
- Very high identity preservation (97.2%) -- does not corrupt clean text
- Good precision on corrections it makes (70.4%)

**Limitations:**
- Conservative -- prefers not to change text when unsure
- Recall is moderate (57.5%) -- misses some errors
- Trained on synthetic data -- may not cover all real-world error patterns

## Training

| Parameter | Value |
|---|---|
| Architecture | Llama 1.08B |
| Method | Full fine-tune |
| Learning rate | 1e-5 |
| Epochs | 3 |
| Effective batch size | 128 |
| Max sequence length | 512 |
| Precision | bf16 |
| Clean ratio | 80% |
| Hardware | 1x NVIDIA H100 80GB |
| Training time | 57 minutes |

## License

MIT

## Benchmark Results

Evaluated on **100-example custom GEC test** (pure model inference, no pre/post pipeline).

| Category | Score |
|----------|-------|
| Орфография (емле) | 2/30 (7%) |
| Грамматика | 6/20 (30%) |
| Пунктуация | 1/15 (7%) |
| Смешанный | 0/20 (0%) |
| Identity preservation | 7/15 (47%) |
| **Total** | **16/100 (16%)** |


## Leaderboard (100-example custom benchmark)

| Модель | Total | Емле/30 | Грамм/20 | Пункт/15 | Смеш/20 | Ident/15 |
|--------|-------|---------|----------|----------|---------|---------|
| **[sozkz-core-llama-600m-kk-gec-v1](https://huggingface.co/stukenov/sozkz-core-llama-600m-kk-gec-v1)** | **47%** | 15 | 12 | 3 | 2 | 15/15 |
| [sozkz-fix-qwen-500m-kk-gec-v3](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v3) | 38% | 0 | 16 | 9 | 0 | 13/15 |
| [sozkz-core-llama-300m-kk-gec-v4](https://huggingface.co/stukenov/sozkz-core-llama-300m-kk-gec-v4) | 37% | 9 | 6 | 4 | 3 | 15/15 |
| [sozkz-fix-qwen-500m-kk-gec-v1](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v1) | 35% | 0 | 12 | 8 | 0 | 15/15 |
| [sozkz-fix-qwen-500m-kk-gec-v2](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v2) | 30% | 0 | 11 | 7 | 0 | 12/15 |
| [sozkz-core-llama-1b-kk-gec-v1](https://huggingface.co/stukenov/sozkz-core-llama-1b-kk-gec-v1) | 16% | 2 | 6 | 1 | 0 | 7/15 |
| [sozkz-fix-qwen-500m-kk-gec-v4](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v4) | 5% | 0 | 1 | 4 | 0 | 0/15 |
| [sozkz-fix-mt5b-kk-gec-run13-v1](https://huggingface.co/stukenov/sozkz-fix-mt5b-kk-gec-run13-v1) | 5% | 0 | 2 | 0 | 0 | 3/15 |
| [sozkz-nllb-1b-kk-gec-v1](https://huggingface.co/stukenov/sozkz-nllb-1b-kk-gec-v1) | 1% | 0 | 1 | 0 | 0 | 0/15 |
| [sozkz-nllb-1b-kk-pretrain-v1](https://huggingface.co/stukenov/sozkz-nllb-1b-kk-pretrain-v1) | 1% | 0 | 1 | 0 | 0 | 0/15 |
| [sozkz-core-llama-300m-kk-gec-v3](https://huggingface.co/stukenov/sozkz-core-llama-300m-kk-gec-v3) | 1% | 0 | 1 | 0 | 0 | 0/15 |
| sozkz-core-llama-300m-kk-gec-v1/v2a/v2b | 0–1% | 0 | 0 | 0 | 0 | 0–1 |
| sozkz-fix-mt5-50m-kk-gec-v1 | 0% | 0 | 0 | 0 | 0 | 0/15 |