--- language: - kk license: mit tags: - gec - grammar-correction - kazakh - llama datasets: - stukenov/sozkz-corpus-synthetic-kk-gec-v1 base_model: stukenov/sozkz-core-llama-1b-kk-base-v1 pipeline_tag: text-generation --- # sozkz-core-llama-1b-kk-gec-v1 Kazakh grammatical error correction (GEC) model based on Llama 1B architecture. ## Model Details - **Base model:** [stukenov/sozkz-core-llama-1b-kk-base-v1](https://huggingface.co/stukenov/sozkz-core-llama-1b-kk-base-v1) - **Method:** Full fine-tune (all 1.08B parameters) - **Dataset:** [stukenov/sozkz-corpus-synthetic-kk-gec-v1](https://huggingface.co/datasets/stukenov/sozkz-corpus-synthetic-kk-gec-v1) (216K error pairs + 216K clean examples) - **Training:** 3 epochs, LR=1e-5, cosine scheduler, bf16, BS=64, 1xH100 80GB, 57 min - **Final loss:** 0.118 ## Format Simple two-line format. Input on first line, model generates correction on second line: ``` {input text} {corrected text} ``` ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "stukenov/sozkz-core-llama-1b-kk-gec-v1" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16).to("cuda") if tokenizer.eos_token is None: tokenizer.add_special_tokens({"eos_token": "<|endoftext|>"}) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token def correct(text): prompt = text + "\n" enc = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda") prompt_len = enc["input_ids"].shape[1] with torch.no_grad(): out = model.generate( **enc, max_new_tokens=256, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id, do_sample=False, ) return tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True).strip() print(correct("Ол мектепке бардым.")) # Ол мектепке барды. ``` ## Results | Metric | Value | |---|---| | CER | 0.019 | | Word Precision | 0.704 | | Word Recall | 0.575 | | Word F0.5 | 0.673 | | Identity Preservation | 97.2% | **Strengths:** - Very high identity preservation (97.2%) -- does not corrupt clean text - Good precision on corrections it makes (70.4%) **Limitations:** - Conservative -- prefers not to change text when unsure - Recall is moderate (57.5%) -- misses some errors - Trained on synthetic data -- may not cover all real-world error patterns ## Training | Parameter | Value | |---|---| | Architecture | Llama 1.08B | | Method | Full fine-tune | | Learning rate | 1e-5 | | Epochs | 3 | | Effective batch size | 128 | | Max sequence length | 512 | | Precision | bf16 | | Clean ratio | 80% | | Hardware | 1x NVIDIA H100 80GB | | Training time | 57 minutes | ## License MIT ## Benchmark Results Evaluated on **100-example custom GEC test** (pure model inference, no pre/post pipeline). | Category | Score | |----------|-------| | Орфография (емле) | 2/30 (7%) | | Грамматика | 6/20 (30%) | | Пунктуация | 1/15 (7%) | | Смешанный | 0/20 (0%) | | Identity preservation | 7/15 (47%) | | **Total** | **16/100 (16%)** | ## Leaderboard (100-example custom benchmark) | Модель | Total | Емле/30 | Грамм/20 | Пункт/15 | Смеш/20 | Ident/15 | |--------|-------|---------|----------|----------|---------|---------| | **[sozkz-core-llama-600m-kk-gec-v1](https://huggingface.co/stukenov/sozkz-core-llama-600m-kk-gec-v1)** | **47%** | 15 | 12 | 3 | 2 | 15/15 | | [sozkz-fix-qwen-500m-kk-gec-v3](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v3) | 38% | 0 | 16 | 9 | 0 | 13/15 | | [sozkz-core-llama-300m-kk-gec-v4](https://huggingface.co/stukenov/sozkz-core-llama-300m-kk-gec-v4) | 37% | 9 | 6 | 4 | 3 | 15/15 | | [sozkz-fix-qwen-500m-kk-gec-v1](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v1) | 35% | 0 | 12 | 8 | 0 | 15/15 | | [sozkz-fix-qwen-500m-kk-gec-v2](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v2) | 30% | 0 | 11 | 7 | 0 | 12/15 | | [sozkz-core-llama-1b-kk-gec-v1](https://huggingface.co/stukenov/sozkz-core-llama-1b-kk-gec-v1) | 16% | 2 | 6 | 1 | 0 | 7/15 | | [sozkz-fix-qwen-500m-kk-gec-v4](https://huggingface.co/stukenov/sozkz-fix-qwen-500m-kk-gec-v4) | 5% | 0 | 1 | 4 | 0 | 0/15 | | [sozkz-fix-mt5b-kk-gec-run13-v1](https://huggingface.co/stukenov/sozkz-fix-mt5b-kk-gec-run13-v1) | 5% | 0 | 2 | 0 | 0 | 3/15 | | [sozkz-nllb-1b-kk-gec-v1](https://huggingface.co/stukenov/sozkz-nllb-1b-kk-gec-v1) | 1% | 0 | 1 | 0 | 0 | 0/15 | | [sozkz-nllb-1b-kk-pretrain-v1](https://huggingface.co/stukenov/sozkz-nllb-1b-kk-pretrain-v1) | 1% | 0 | 1 | 0 | 0 | 0/15 | | [sozkz-core-llama-300m-kk-gec-v3](https://huggingface.co/stukenov/sozkz-core-llama-300m-kk-gec-v3) | 1% | 0 | 1 | 0 | 0 | 0/15 | | sozkz-core-llama-300m-kk-gec-v1/v2a/v2b | 0–1% | 0 | 0 | 0 | 0 | 0–1 | | sozkz-fix-mt5-50m-kk-gec-v1 | 0% | 0 | 0 | 0 | 0 | 0/15 |