mathiaskabango commited on
Commit
2993e44
Β·
verified Β·
1 Parent(s): 5f45a3e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +279 -50
README.md CHANGED
@@ -2,78 +2,307 @@
2
  library_name: transformers
3
  license: apache-2.0
4
  base_model: mathiaskabango/shona-mt5-small
 
 
5
  tags:
 
 
 
 
 
 
 
6
  - generated_from_trainer
7
  model-index:
8
  - name: taurabot-shona
9
  results: []
10
  ---
11
 
12
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
- should probably proofread and complete it, then remove this comment. -->
14
 
15
- # taurabot-shona
16
 
17
- This model is a fine-tuned version of [mathiaskabango/shona-mt5-small](https://huggingface.co/mathiaskabango/shona-mt5-small) on the None dataset.
18
- It achieves the following results on the evaluation set:
19
- - Loss: 2.5784
 
20
 
21
- ## Model description
 
 
22
 
23
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
- ## Intended uses & limitations
26
 
27
- More information needed
 
 
 
 
 
 
 
 
 
 
28
 
29
- ## Training and evaluation data
30
 
31
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
- ## Training procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  ### Training hyperparameters
36
 
37
- The following hyperparameters were used during training:
38
- - learning_rate: 0.0003
39
- - train_batch_size: 8
40
- - eval_batch_size: 8
41
- - seed: 42
42
- - gradient_accumulation_steps: 2
43
- - total_train_batch_size: 16
44
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
45
- - lr_scheduler_type: linear
46
- - lr_scheduler_warmup_ratio: 0.03
47
- - num_epochs: 200
48
- - mixed_precision_training: Native AMP
49
 
50
  ### Training results
51
 
52
- | Training Loss | Epoch | Step | Validation Loss |
53
- |:-------------:|:-----:|:----:|:---------------:|
54
- | No log | 1.0 | 40 | 7.2146 |
55
- | 11.7061 | 2.0 | 80 | 4.9347 |
56
- | 6.0485 | 3.0 | 120 | 3.3783 |
57
- | 3.8857 | 4.0 | 160 | 2.8899 |
58
- | 2.9967 | 5.0 | 200 | 2.5140 |
59
- | 2.9967 | 6.0 | 240 | 2.4059 |
60
- | 2.5225 | 7.0 | 280 | 2.3723 |
61
- | 2.2792 | 8.0 | 320 | 2.3340 |
62
- | 2.071 | 9.0 | 360 | 2.3476 |
63
- | 1.9104 | 10.0 | 400 | 2.3342 |
64
- | 1.9104 | 11.0 | 440 | 2.3480 |
65
- | 1.7523 | 12.0 | 480 | 2.3748 |
66
- | 1.6213 | 13.0 | 520 | 2.3932 |
67
- | 1.5312 | 14.0 | 560 | 2.4189 |
68
- | 1.4241 | 15.0 | 600 | 2.4628 |
69
- | 1.4241 | 16.0 | 640 | 2.4841 |
70
- | 1.3007 | 17.0 | 680 | 2.5431 |
71
- | 1.2119 | 18.0 | 720 | 2.5784 |
72
-
73
 
74
  ### Framework versions
75
 
76
- - Transformers 4.57.6
77
- - Pytorch 2.10.0+cu128
78
- - Datasets 2.21.0
79
- - Tokenizers 0.22.2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  library_name: transformers
3
  license: apache-2.0
4
  base_model: mathiaskabango/shona-mt5-small
5
+ language:
6
+ - sn
7
  tags:
8
+ - shona
9
+ - african-languages
10
+ - low-resource-nlp
11
+ - conversational-ai
12
+ - chatbot
13
+ - mt5
14
+ - zimbabwe
15
  - generated_from_trainer
16
  model-index:
17
  - name: taurabot-shona
18
  results: []
19
  ---
20
 
21
+ # TauraBot β€” Shona Conversational AI
 
22
 
23
+ > **"Taura" means "Speak" in Shona (chiShona)**
24
 
25
+ TauraBot is the first open-source conversational AI model built specifically
26
+ for Shona speakers. It is a fine-tuned version of
27
+ [mathiaskabango/shona-mt5-small](https://huggingface.co/mathiaskabango/shona-mt5-small)
28
+ β€” itself a continued pre-training of Google's mT5-small on a Shona text corpus.
29
 
30
+ Shona is spoken by approximately 15 million people, primarily in Zimbabwe,
31
+ yet remains almost entirely absent from modern NLP research and tooling.
32
+ TauraBot is a step toward changing that.
33
 
34
+ ---
35
+
36
+ ## ⚠️ Important β€” Please Read Before Using
37
+
38
+ > **This model is an early-stage research release and not yet production ready.**
39
+
40
+ Due to significant GPU constraints during training, this model was fine-tuned
41
+ on a limited dataset with restricted compute. As a result:
42
+
43
+ - Responses may be **inconsistent or grammatically imperfect**
44
+ - The model may **repeat phrases** or produce generic outputs
45
+ - It performs best on **simple conversational exchanges** similar to its
46
+ training data
47
+ - It will **not** handle complex or domain-specific Shona well yet
48
+
49
+ **If you want to use this model in a real application, we strongly recommend
50
+ further fine-tuning on your own Shona conversational data.** See the
51
+ fine-tuning guide below.
52
+
53
+ This model is actively being improved. A better version with more training
54
+ data and compute is planned for release. Watch this repo for updates.
55
+
56
+ ---
57
+
58
+ ## Model Details
59
+
60
+ | Property | Details |
61
+ |---|---|
62
+ | **Base Model** | [mathiaskabango/shona-mt5-small](https://huggingface.co/mathiaskabango/shona-mt5-small) |
63
+ | **Model Type** | Seq2Seq Conversational (Text-to-Text) |
64
+ | **Language** | Shona (`sn`) |
65
+ | **License** | Apache 2.0 |
66
+ | **Developer** | Mathias Kabango β€” African Leadership University, Kigali, Rwanda |
67
+ | **Training Data** | 500 curated Shona conversation pairs |
68
+ | **Task Prefix** | `taura:` |
69
+
70
+ ---
71
+
72
+ ## How to Use
73
+
74
+ The model requires a `taura:` prefix on all inputs. Without this prefix
75
+ it will not behave conversationally.
76
+
77
+ ### Basic inference
78
+
79
+ ```python
80
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
81
+
82
+ tokenizer = AutoTokenizer.from_pretrained("mathiaskabango/taurabot-shona")
83
+ model = AutoModelForSeq2SeqLM.from_pretrained("mathiaskabango/taurabot-shona")
84
+
85
+ def chat(message):
86
+ # Always include the task prefix
87
+ input_text = "taura: " + message.strip()
88
+ inputs = tokenizer(
89
+ input_text,
90
+ return_tensors="pt",
91
+ max_length=64,
92
+ truncation=True,
93
+ )
94
+ outputs = model.generate(
95
+ **inputs,
96
+ max_new_tokens=60,
97
+ num_beams=4,
98
+ no_repeat_ngram_size=3,
99
+ repetition_penalty=2.0,
100
+ early_stopping=True,
101
+ )
102
+ return tokenizer.decode(outputs[0], skip_special_tokens=True)
103
+
104
+ # Example conversations
105
+ print(chat("Mhoro, makadii?"))
106
+ # Expected: "Ndiripo mazvita, imi makadii?"
107
+
108
+ print(chat("Zita rako ndiani?"))
109
+ # Expected: "Zita rangu ndiTauraBot."
110
+
111
+ print(chat("Unoda kudya chii?"))
112
+ # Expected: "Ndinoda sadza nemufushwa."
113
+ ```
114
+
115
+ ### Simple chat loop
116
+
117
+ ```python
118
+ print("TauraBot β€” Taura neni! (type 'exit' to quit)\n")
119
+ while True:
120
+ user = input("Iwe: ")
121
+ if user.lower() == "exit":
122
+ break
123
+ print(f"TauraBot: {chat(user)}\n")
124
+ ```
125
+
126
+ ---
127
+
128
+ ## How to Fine-Tune Further (Recommended)
129
+
130
+ Because this model was trained under compute constraints, **further fine-tuning
131
+ on your own data will significantly improve quality.** Here is a minimal
132
+ script to continue training:
133
+
134
+ ```python
135
+ from transformers import (
136
+ AutoTokenizer, AutoModelForSeq2SeqLM,
137
+ Seq2SeqTrainer, Seq2SeqTrainingArguments,
138
+ DataCollatorForSeq2Seq,
139
+ )
140
+ from datasets import Dataset
141
+
142
+ MODEL = "mathiaskabango/taurabot-shona"
143
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
144
+ model = AutoModelForSeq2SeqLM.from_pretrained(MODEL)
145
+
146
+ # Your conversation pairs β€” the more the better
147
+ # Format: input is the human turn, target is the bot response
148
+ my_conversations = [
149
+ {"input": "taura: Mhoro!", "target": "Mhoro! Makadii?"},
150
+ {"input": "taura: Ndiri kuneta.", "target": "Zorora zvishoma. Unokwanisa!"},
151
+ # add as many as you have β€” 1000+ pairs recommended
152
+ ]
153
 
154
+ dataset = Dataset.from_list(my_conversations)
155
 
156
+ def preprocess(batch):
157
+ inputs = tokenizer(batch["input"], max_length=64,
158
+ truncation=True, padding="max_length")
159
+ labels = tokenizer(batch["target"], max_length=64,
160
+ truncation=True, padding="max_length")
161
+ labels["input_ids"] = [
162
+ [(t if t != tokenizer.pad_token_id else -100) for t in label]
163
+ for label in labels["input_ids"]
164
+ ]
165
+ inputs["labels"] = labels["input_ids"]
166
+ return inputs
167
 
168
+ tokenized = dataset.map(preprocess, batched=True)
169
 
170
+ args = Seq2SeqTrainingArguments(
171
+ output_dir="taurabot-finetuned",
172
+ num_train_epochs=20, # increase for better results
173
+ per_device_train_batch_size=4,
174
+ gradient_accumulation_steps=4,
175
+ learning_rate=1e-4, # lower LR when continuing from checkpoint
176
+ warmup_steps=50,
177
+ predict_with_generate=True,
178
+ logging_steps=10,
179
+ save_strategy="epoch",
180
+ fp16=True,
181
+ push_to_hub=False, # set True to push to your own HF repo
182
+ )
183
 
184
+ trainer = Seq2SeqTrainer(
185
+ model=model,
186
+ args=args,
187
+ train_dataset=tokenized,
188
+ data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
189
+ )
190
+
191
+ trainer.train()
192
+
193
+ # Save your improved model
194
+ model.save_pretrained("taurabot-finetuned")
195
+ tokenizer.save_pretrained("taurabot-finetuned")
196
+ print("Done! Test your improved model.")
197
+ ```
198
+
199
+ ### Tips for better fine-tuning results
200
+
201
+ - **More data is the single biggest improvement** β€” aim for 1,000 to 5,000
202
+ conversation pairs
203
+ - Use **native speaker corrections** if possible
204
+ - Keep conversations **short and natural** β€” 1 to 2 sentences per turn
205
+ - Always use the `taura:` prefix in your input column
206
+ - A lower learning rate (`1e-4` or `5e-5`) prevents overwriting what the
207
+ model already knows
208
+
209
+ ---
210
+
211
+ ## ⚠️ Limitations
212
+
213
+ | Limitation | Detail |
214
+ |---|---|
215
+ | **Compute constraints** | Trained on a single consumer GPU with limited VRAM. Only 18 epochs completed before overfitting began. |
216
+ | **Small training set** | Fine-tuned on 500 conversation pairs β€” significantly below the recommended minimum for production conversational models |
217
+ | **Early overfitting** | Validation loss stopped improving after epoch 8 (2.33) and began rising β€” a sign the model needs more diverse training data |
218
+ | **Hallucinated prefixes** | May occasionally output "Mubvunzo:" or similar artefacts inherited from pre-training data |
219
+ | **Limited domain coverage** | Trained primarily on everyday conversational Shona β€” will not handle medical, legal, or technical topics |
220
+ | **Dialect coverage** | Covers standard Shona as spoken in Zimbabwe β€” may not generalise to regional dialects |
221
+ | **Not for high-stakes use** | Should not be used for medical advice, legal decisions, or any critical application without significant further development |
222
+
223
+ ---
224
+
225
+ ## Training Details
226
+
227
+ ### What the loss curve tells us
228
+
229
+ Epoch 8: Validation loss 2.33 ← best checkpoint
230
+ Epoch 9: Validation loss 2.35 ← started rising (overfitting)
231
+ Epoch 18: Validation loss 2.58 ← continued rising
232
+
233
+ The model began overfitting after epoch 8 because 500 conversation pairs
234
+ is a small dataset for a seq2seq model. The best weights are from around
235
+ epoch 8. More diverse training data would push the validation loss lower
236
+ before overfitting begins.
237
 
238
  ### Training hyperparameters
239
 
240
+ | Parameter | Value |
241
+ |---|---|
242
+ | Learning Rate | 3e-4 |
243
+ | Train Batch Size | 8 |
244
+ | Gradient Accumulation | 2 (effective batch = 16) |
245
+ | Warmup Ratio | 0.03 |
246
+ | Epochs | 18 (of 200 planned) |
247
+ | Mixed Precision | fp16 |
248
+ | Optimizer | AdamW (fused) |
249
+ | Seed | 42 |
 
 
250
 
251
  ### Training results
252
 
253
+ | Epoch | Step | Training Loss | Validation Loss |
254
+ |:-----:|:----:|:-------------:|:---------------:|
255
+ | 2 | 80 | 11.7061 | 4.9347 |
256
+ | 3 | 120 | 6.0485 | 3.3783 |
257
+ | 4 | 160 | 3.8857 | 2.8899 |
258
+ | 5 | 200 | 2.9967 | 2.5140 |
259
+ | 6 | 240 | 2.5225 | 2.4059 |
260
+ | 7 | 280 | 2.2792 | 2.3723 |
261
+ | **8** | **320** | **2.071** | **2.3340** ← best |
262
+ | 9 | 360 | 1.9104 | 2.3476 |
263
+ | 18 | 720 | 1.2119 | 2.5784 |
 
 
 
 
 
 
 
 
 
 
264
 
265
  ### Framework versions
266
 
267
+ | Library | Version |
268
+ |---|---|
269
+ | Transformers | 4.57.6 |
270
+ | PyTorch | 2.10.0+cu128 |
271
+ | Datasets | 2.21.0 |
272
+ | Tokenizers | 0.22.2 |
273
+
274
+ ---
275
+
276
+ ## Roadmap
277
+
278
+ - [ ] **TauraBot v2** β€” retrain base model with more steps and larger corpus
279
+ - [ ] **Larger conversation dataset** β€” expanding beyond 500 pairs
280
+ - [ ] **Shona corpus public release** β€” `mathiaskabango/shona-corpus`
281
+ - [ ] **Gradio demo space** β€” interactive TauraBot demo
282
+ - [ ] **Shona Whisper** β€” speech recognition for Shona
283
+
284
+ ---
285
+
286
+ ## Contact
287
+
288
+ **Developer:** Mathias Kabango
289
+ **Institution:** African Leadership University, Kigali, Rwanda
290
+ **Email:** kabangomathias0@gmail.com
291
+ **GitHub:** [Mathias-Kabango3](https://github.com/Mathias-Kabango3)
292
+ **Base model:** [mathiaskabango/shona-mt5-small](https://huggingface.co/mathiaskabango/shona-mt5-small)
293
+
294
+ If you fine-tune this model and get good results, please open a discussion
295
+ on this repo and share what worked β€” it will help everyone building Shona
296
+ NLP tools.
297
+
298
+ ---
299
+
300
+ ## Acknowledgements
301
+
302
+ Built as part of a mission to create open-source AI infrastructure for
303
+ African languages. If you are working on Shona, Ndebele, or related Bantu
304
+ languages and want to collaborate, please reach out.
305
+
306
+ ---
307
+
308
+ *Built with ❀️ *