--- library_name: transformers license: apache-2.0 base_model: mathiaskabango/shona-mt5-small language: - sn tags: - shona - african-languages - low-resource-nlp - conversational-ai - chatbot - mt5 - zimbabwe - generated_from_trainer model-index: - name: taurabot-shona results: [] --- # TauraBot — Shona Conversational AI > **"Taura" means "Speak" in Shona (chiShona)** TauraBot is the first open-source conversational AI model built specifically for Shona speakers. It is a fine-tuned version of [mathiaskabango/shona-mt5-small](https://huggingface.co/mathiaskabango/shona-mt5-small) — itself a continued pre-training of Google's mT5-small on a Shona text corpus. Shona is spoken by approximately 15 million people, primarily in Zimbabwe, yet remains almost entirely absent from modern NLP research and tooling. TauraBot is a step toward changing that. --- ## ⚠️ Important — Please Read Before Using > **This model is an early-stage research release and not yet production ready.** Due to significant GPU constraints during training, this model was fine-tuned on a limited dataset with restricted compute. As a result: - Responses may be **inconsistent or grammatically imperfect** - The model may **repeat phrases** or produce generic outputs - It performs best on **simple conversational exchanges** similar to its training data - It will **not** handle complex or domain-specific Shona well yet **If you want to use this model in a real application, we strongly recommend further fine-tuning on your own Shona conversational data.** See the fine-tuning guide below. This model is actively being improved. A better version with more training data and compute is planned for release. Watch this repo for updates. --- ## Model Details | Property | Details | |---|---| | **Base Model** | [mathiaskabango/shona-mt5-small](https://huggingface.co/mathiaskabango/shona-mt5-small) | | **Model Type** | Seq2Seq Conversational (Text-to-Text) | | **Language** | Shona (`sn`) | | **License** | Apache 2.0 | | **Developer** | Mathias Kabango — African Leadership University, Kigali, Rwanda | | **Training Data** | 500 curated Shona conversation pairs | | **Task Prefix** | `taura:` | --- ## How to Use The model requires a `taura:` prefix on all inputs. Without this prefix it will not behave conversationally. ### Basic inference ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("mathiaskabango/taurabot-shona") model = AutoModelForSeq2SeqLM.from_pretrained("mathiaskabango/taurabot-shona") def chat(message): # Always include the task prefix input_text = "taura: " + message.strip() inputs = tokenizer( input_text, return_tensors="pt", max_length=64, truncation=True, ) outputs = model.generate( **inputs, max_new_tokens=60, num_beams=4, no_repeat_ngram_size=3, repetition_penalty=2.0, early_stopping=True, ) return tokenizer.decode(outputs[0], skip_special_tokens=True) # Example conversations print(chat("Mhoro, makadii?")) # Expected: "Ndiripo mazvita, imi makadii?" print(chat("Zita rako ndiani?")) # Expected: "Zita rangu ndiTauraBot." print(chat("Unoda kudya chii?")) # Expected: "Ndinoda sadza nemufushwa." ``` ### Simple chat loop ```python print("TauraBot — Taura neni! (type 'exit' to quit)\n") while True: user = input("Iwe: ") if user.lower() == "exit": break print(f"TauraBot: {chat(user)}\n") ``` --- ## How to Fine-Tune Further (Recommended) Because this model was trained under compute constraints, **further fine-tuning on your own data will significantly improve quality.** Here is a minimal script to continue training: ```python from transformers import ( AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainer, Seq2SeqTrainingArguments, DataCollatorForSeq2Seq, ) from datasets import Dataset MODEL = "mathiaskabango/taurabot-shona" tokenizer = AutoTokenizer.from_pretrained(MODEL) model = AutoModelForSeq2SeqLM.from_pretrained(MODEL) # Your conversation pairs — the more the better # Format: input is the human turn, target is the bot response my_conversations = [ {"input": "taura: Mhoro!", "target": "Mhoro! Makadii?"}, {"input": "taura: Ndiri kuneta.", "target": "Zorora zvishoma. Unokwanisa!"}, # add as many as you have — 1000+ pairs recommended ] dataset = Dataset.from_list(my_conversations) def preprocess(batch): inputs = tokenizer(batch["input"], max_length=64, truncation=True, padding="max_length") labels = tokenizer(batch["target"], max_length=64, truncation=True, padding="max_length") labels["input_ids"] = [ [(t if t != tokenizer.pad_token_id else -100) for t in label] for label in labels["input_ids"] ] inputs["labels"] = labels["input_ids"] return inputs tokenized = dataset.map(preprocess, batched=True) args = Seq2SeqTrainingArguments( output_dir="taurabot-finetuned", num_train_epochs=20, # increase for better results per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=1e-4, # lower LR when continuing from checkpoint warmup_steps=50, predict_with_generate=True, logging_steps=10, save_strategy="epoch", fp16=True, push_to_hub=False, # set True to push to your own HF repo ) trainer = Seq2SeqTrainer( model=model, args=args, train_dataset=tokenized, data_collator=DataCollatorForSeq2Seq(tokenizer, model=model), ) trainer.train() # Save your improved model model.save_pretrained("taurabot-finetuned") tokenizer.save_pretrained("taurabot-finetuned") print("Done! Test your improved model.") ``` ### Tips for better fine-tuning results - **More data is the single biggest improvement** — aim for 1,000 to 5,000 conversation pairs - Use **native speaker corrections** if possible - Keep conversations **short and natural** — 1 to 2 sentences per turn - Always use the `taura:` prefix in your input column - A lower learning rate (`1e-4` or `5e-5`) prevents overwriting what the model already knows --- ## ⚠️ Limitations | Limitation | Detail | |---|---| | **Compute constraints** | Trained on a single consumer GPU with limited VRAM. Only 18 epochs completed before overfitting began. | | **Small training set** | Fine-tuned on 500 conversation pairs — significantly below the recommended minimum for production conversational models | | **Early overfitting** | Validation loss stopped improving after epoch 8 (2.33) and began rising — a sign the model needs more diverse training data | | **Hallucinated prefixes** | May occasionally output "Mubvunzo:" or similar artefacts inherited from pre-training data | | **Limited domain coverage** | Trained primarily on everyday conversational Shona — will not handle medical, legal, or technical topics | | **Dialect coverage** | Covers standard Shona as spoken in Zimbabwe — may not generalise to regional dialects | | **Not for high-stakes use** | Should not be used for medical advice, legal decisions, or any critical application without significant further development | --- ## Training Details ### What the loss curve tells us Epoch 8: Validation loss 2.33 ← best checkpoint Epoch 9: Validation loss 2.35 ← started rising (overfitting) Epoch 18: Validation loss 2.58 ← continued rising The model began overfitting after epoch 8 because 500 conversation pairs is a small dataset for a seq2seq model. The best weights are from around epoch 8. More diverse training data would push the validation loss lower before overfitting begins. ### Training hyperparameters | Parameter | Value | |---|---| | Learning Rate | 3e-4 | | Train Batch Size | 8 | | Gradient Accumulation | 2 (effective batch = 16) | | Warmup Ratio | 0.03 | | Epochs | 18 (of 200 planned) | | Mixed Precision | fp16 | | Optimizer | AdamW (fused) | | Seed | 42 | ### Training results | Epoch | Step | Training Loss | Validation Loss | |:-----:|:----:|:-------------:|:---------------:| | 2 | 80 | 11.7061 | 4.9347 | | 3 | 120 | 6.0485 | 3.3783 | | 4 | 160 | 3.8857 | 2.8899 | | 5 | 200 | 2.9967 | 2.5140 | | 6 | 240 | 2.5225 | 2.4059 | | 7 | 280 | 2.2792 | 2.3723 | | **8** | **320** | **2.071** | **2.3340** ← best | | 9 | 360 | 1.9104 | 2.3476 | | 18 | 720 | 1.2119 | 2.5784 | ### Framework versions | Library | Version | |---|---| | Transformers | 4.57.6 | | PyTorch | 2.10.0+cu128 | | Datasets | 2.21.0 | | Tokenizers | 0.22.2 | --- ## Roadmap - [ ] **TauraBot v2** — retrain base model with more steps and larger corpus - [ ] **Larger conversation dataset** — expanding beyond 500 pairs - [ ] **Shona corpus public release** — `mathiaskabango/shona-corpus` - [ ] **Gradio demo space** — interactive TauraBot demo - [ ] **Shona Whisper** — speech recognition for Shona --- ## Contact **Developer:** Mathias Kabango **Institution:** African Leadership University, Kigali, Rwanda **Email:** kabangomathias0@gmail.com **GitHub:** [Mathias-Kabango3](https://github.com/Mathias-Kabango3) **Base model:** [mathiaskabango/shona-mt5-small](https://huggingface.co/mathiaskabango/shona-mt5-small) If you fine-tune this model and get good results, please open a discussion on this repo and share what worked — it will help everyone building Shona NLP tools. --- ## Acknowledgements Built as part of a mission to create open-source AI infrastructure for African languages. If you are working on Shona, Ndebele, or related Bantu languages and want to collaborate, please reach out. --- *Built with ❤️ *