---
library_name: transformers
license: apache-2.0
base_model: mathiaskabango/shona-mt5-small
language:
- sn
tags:
- shona
- african-languages
- low-resource-nlp
- conversational-ai
- chatbot
- mt5
- zimbabwe
- generated_from_trainer
model-index:
- name: taurabot-shona
  results: []
---

# TauraBot — Shona Conversational AI

> **"Taura" means "Speak" in Shona (chiShona)**

TauraBot is the first open-source conversational AI model built specifically
for Shona speakers. It is a fine-tuned version of
[mathiaskabango/shona-mt5-small](https://huggingface.co/mathiaskabango/shona-mt5-small)
— itself a continued pre-training of Google's mT5-small on a Shona text corpus.

Shona is spoken by approximately 15 million people, primarily in Zimbabwe,
yet remains almost entirely absent from modern NLP research and tooling.
TauraBot is a step toward changing that.

---

## ⚠️ Important — Please Read Before Using

> **This model is an early-stage research release and not yet production ready.**

Due to significant GPU constraints during training, this model was fine-tuned
on a limited dataset with restricted compute. As a result:

- Responses may be **inconsistent or grammatically imperfect**
- The model may **repeat phrases** or produce generic outputs
- It performs best on **simple conversational exchanges** similar to its
  training data
- It will **not** handle complex or domain-specific Shona well yet

**If you want to use this model in a real application, we strongly recommend
further fine-tuning on your own Shona conversational data.** See the
fine-tuning guide below.

This model is actively being improved. A better version with more training
data and compute is planned for release. Watch this repo for updates.

---

## Model Details

| Property | Details |
|---|---|
| **Base Model** | [mathiaskabango/shona-mt5-small](https://huggingface.co/mathiaskabango/shona-mt5-small) |
| **Model Type** | Seq2Seq Conversational (Text-to-Text) |
| **Language** | Shona (`sn`) |
| **License** | Apache 2.0 |
| **Developer** | Mathias Kabango — African Leadership University, Kigali, Rwanda |
| **Training Data** | 500 curated Shona conversation pairs |
| **Task Prefix** | `taura:` |

---

## How to Use

The model requires a `taura:` prefix on all inputs. Without this prefix
it will not behave conversationally.

### Basic inference

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("mathiaskabango/taurabot-shona")
model = AutoModelForSeq2SeqLM.from_pretrained("mathiaskabango/taurabot-shona")

def chat(message):
    # Always include the task prefix
    input_text = "taura: " + message.strip()
    inputs = tokenizer(
        input_text,
        return_tensors="pt",
        max_length=64,
        truncation=True,
    )
    outputs = model.generate(
        **inputs,
        max_new_tokens=60,
        num_beams=4,
        no_repeat_ngram_size=3,
        repetition_penalty=2.0,
        early_stopping=True,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example conversations
print(chat("Mhoro, makadii?"))
# Expected: "Ndiripo mazvita, imi makadii?"

print(chat("Zita rako ndiani?"))
# Expected: "Zita rangu ndiTauraBot."

print(chat("Unoda kudya chii?"))
# Expected: "Ndinoda sadza nemufushwa."
```

### Simple chat loop

```python
print("TauraBot  — Taura neni! (type 'exit' to quit)\n")
while True:
    user = input("Iwe:      ")
    if user.lower() == "exit":
        break
    print(f"TauraBot: {chat(user)}\n")
```

---

## How to Fine-Tune Further (Recommended)

Because this model was trained under compute constraints, **further fine-tuning
on your own data will significantly improve quality.** Here is a minimal
script to continue training:

```python
from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM,
    Seq2SeqTrainer, Seq2SeqTrainingArguments,
    DataCollatorForSeq2Seq,
)
from datasets import Dataset

MODEL = "mathiaskabango/taurabot-shona"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL)

# Your conversation pairs — the more the better
# Format: input is the human turn, target is the bot response
my_conversations = [
    {"input": "taura: Mhoro!", "target": "Mhoro! Makadii?"},
    {"input": "taura: Ndiri kuneta.", "target": "Zorora zvishoma. Unokwanisa!"},
    # add as many as you have — 1000+ pairs recommended
]

dataset = Dataset.from_list(my_conversations)

def preprocess(batch):
    inputs = tokenizer(batch["input"],  max_length=64,
                       truncation=True, padding="max_length")
    labels = tokenizer(batch["target"], max_length=64,
                       truncation=True, padding="max_length")
    labels["input_ids"] = [
        [(t if t != tokenizer.pad_token_id else -100) for t in label]
        for label in labels["input_ids"]
    ]
    inputs["labels"] = labels["input_ids"]
    return inputs

tokenized = dataset.map(preprocess, batched=True)

args = Seq2SeqTrainingArguments(
    output_dir="taurabot-finetuned",
    num_train_epochs=20,              # increase for better results
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-4,               # lower LR when continuing from checkpoint
    warmup_steps=50,
    predict_with_generate=True,
    logging_steps=10,
    save_strategy="epoch",
    fp16=True,
    push_to_hub=False,                # set True to push to your own HF repo
)

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=tokenized,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
)

trainer.train()

# Save your improved model
model.save_pretrained("taurabot-finetuned")
tokenizer.save_pretrained("taurabot-finetuned")
print("Done! Test your improved model.")
```

### Tips for better fine-tuning results

- **More data is the single biggest improvement** — aim for 1,000 to 5,000
  conversation pairs
- Use **native speaker corrections** if possible
- Keep conversations **short and natural** — 1 to 2 sentences per turn
- Always use the `taura:` prefix in your input column
- A lower learning rate (`1e-4` or `5e-5`) prevents overwriting what the
  model already knows

---

## ⚠️ Limitations

| Limitation | Detail |
|---|---|
| **Compute constraints** | Trained on a single consumer GPU with limited VRAM. Only 18 epochs completed before overfitting began. |
| **Small training set** | Fine-tuned on 500 conversation pairs — significantly below the recommended minimum for production conversational models |
| **Early overfitting** | Validation loss stopped improving after epoch 8 (2.33) and began rising — a sign the model needs more diverse training data |
| **Hallucinated prefixes** | May occasionally output "Mubvunzo:" or similar artefacts inherited from pre-training data |
| **Limited domain coverage** | Trained primarily on everyday conversational Shona — will not handle medical, legal, or technical topics |
| **Dialect coverage** | Covers standard Shona as spoken in Zimbabwe — may not generalise to regional dialects |
| **Not for high-stakes use** | Should not be used for medical advice, legal decisions, or any critical application without significant further development |

---

## Training Details

### What the loss curve tells us

Epoch  8:  Validation loss 2.33  ← best checkpoint
Epoch  9:  Validation loss 2.35  ← started rising (overfitting)
Epoch 18:  Validation loss 2.58  ← continued rising

The model began overfitting after epoch 8 because 500 conversation pairs
is a small dataset for a seq2seq model. The best weights are from around
epoch 8. More diverse training data would push the validation loss lower
before overfitting begins.

### Training hyperparameters

| Parameter | Value |
|---|---|
| Learning Rate | 3e-4 |
| Train Batch Size | 8 |
| Gradient Accumulation | 2 (effective batch = 16) |
| Warmup Ratio | 0.03 |
| Epochs | 18 (of 200 planned) |
| Mixed Precision | fp16 |
| Optimizer | AdamW (fused) |
| Seed | 42 |

### Training results

| Epoch | Step | Training Loss | Validation Loss |
|:-----:|:----:|:-------------:|:---------------:|
| 2     | 80   | 11.7061       | 4.9347          |
| 3     | 120  | 6.0485        | 3.3783          |
| 4     | 160  | 3.8857        | 2.8899          |
| 5     | 200  | 2.9967        | 2.5140          |
| 6     | 240  | 2.5225        | 2.4059          |
| 7     | 280  | 2.2792        | 2.3723          |
| **8** | **320** | **2.071** | **2.3340** ← best |
| 9     | 360  | 1.9104        | 2.3476          |
| 18    | 720  | 1.2119        | 2.5784          |

### Framework versions

| Library | Version |
|---|---|
| Transformers | 4.57.6 |
| PyTorch | 2.10.0+cu128 |
| Datasets | 2.21.0 |
| Tokenizers | 0.22.2 |

---

## Roadmap

- [ ] **TauraBot v2** — retrain base model with more steps and larger corpus
- [ ] **Larger conversation dataset** — expanding beyond 500 pairs
- [ ] **Shona corpus public release** — `mathiaskabango/shona-corpus`
- [ ] **Gradio demo space** — interactive TauraBot demo
- [ ] **Shona Whisper** — speech recognition for Shona

---

## Contact

**Developer:** Mathias Kabango
**Institution:** African Leadership University, Kigali, Rwanda
**Email:** kabangomathias0@gmail.com
**GitHub:** [Mathias-Kabango3](https://github.com/Mathias-Kabango3)
**Base model:** [mathiaskabango/shona-mt5-small](https://huggingface.co/mathiaskabango/shona-mt5-small)

If you fine-tune this model and get good results, please open a discussion
on this repo and share what worked — it will help everyone building Shona
NLP tools.

---

## Acknowledgements

Built as part of a mission to create open-source AI infrastructure for
African languages. If you are working on Shona, Ndebele, or related Bantu
languages and want to collaborate, please reach out.

---

*Built with ❤️ *