---
license: mit
language:
- kk
tags:
- sentiment-analysis
- kazakh
- classification
- llama
base_model: stukenov/sozkz-core-llama-600m-kk-base-v1
datasets:
- issai/kazsandra
pipeline_tag: text-generation
---

# SozKZ Core Llama 600M — Kazakh Sentiment (v1)

Binary sentiment classifier for Kazakh text, fine-tuned from [sozkz-core-llama-600m-kk-base-v1](https://huggingface.co/stukenov/sozkz-core-llama-600m-kk-base-v1).

## Usage

The model uses a special `<sentiment>` tag for classification:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "stukenov/sozkz-core-llama-600m-kk-sentiment-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model.eval()

text = "Тамақтары өте дәмді, қызмет көрсету керемет!"
prompt = f"<sentiment>{text}</sentiment>
"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=5, do_sample=False)

generated = output[0][inputs["input_ids"].shape[1]:]
label = tokenizer.decode(generated, skip_special_tokens=True).strip()
print(label)  # "positive"
```

## Training

| Parameter | Value |
|---|---|
| Base model | sozkz-core-llama-600m-kk-base-v1 (587M params) |
| Dataset | issai/kazsandra (KazSAnDRA) → binary (positive/negative) |
| Train samples | 57,312 (balanced) |
| Val samples | 3,016 |
| Epochs | 3 |
| Batch size | 64 (8 × 4 GPU × 2 accum) |
| Learning rate | 2e-5 (cosine) |
| Final loss | ~0.10 |
| Hardware | 4× RTX 4090 |
| Training time | ~1.9h |

## Dataset

Based on [issai/kazsandra](https://huggingface.co/datasets/issai/kazsandra) (LREC 2024).
Scores 1-2 mapped to **negative**, 4-5 to **positive**, 3 (neutral) excluded.
Classes balanced by undersampling majority class.

## Results

10/10 on manual test examples covering positive, negative, and ambiguous inputs.

## License

MIT