--- language: - kab - ber tags: - emotion-classification - african-languages - amazigh - low-resource - goemotions - afro-asiatic license: apache-2.0 library_name: transformers base_model: Davlan/afro-xlmr-large model-index: - name: kabyle-emotion-afro-xlmr results: - task: type: text-classification name: Emotion Classification dataset: type: silver-labeled name: English-Kabyle Parallel Corpus (Tatoeba + Round-trip) metrics: - type: f1 value: 0.817 name: Validation Weighted F1 - type: accuracy value: 0.815 name: Validation Accuracy - type: f1 value: 0.641 name: Test Weighted F1 - type: accuracy value: 0.648 name: Test Accuracy --- # Kabyle Emotion Classifier (AfroXLMR-Large + GoEmotions) A fine-tuned **AfroXLMR-Large** model for **28-class emotion recognition in Kabyle** (Taqbaylit), a low-resource Afro-Asiatic and Amazigh language spoken in Algeria. This is the third iteration of the Kabyle emotion model, upgrading from XLM-RoBERTa-base to AfroXLMR-Large and from 6-class Ekman labels to 28-class GoEmotions fine-grained labels. --- ## Model Details | Attribute | Value | |-----------|-------| | **Base model** | `Davlan/afro-xlmr-large` (AfroXLMR-Large, ~560M params) | | **Architecture** | XLM-RoBERTa for Sequence Classification | | **Parameters** | ~560M | | **Language** | Kabyle (`kab`) | | **Task** | Text Classification (Emotion Detection) | | **Classes** | 28 — GoEmotions taxonomy | | **Best checkpoint** | Epoch 5 (loaded via `load_best_model_at_end`) | ### 28 Emotion Classes `admiration`, `amusement`, `anger`, `annoyance`, `approval`, `caring`, `confusion`, `curiosity`, `desire`, `disappointment`, `disapproval`, `disgust`, `embarrassment`, `excitement`, `fear`, `gratitude`, `grief`, `joy`, `love`, `nervousness`, `neutral`, `optimism`, `pride`, `realization`, `relief`, `remorse`, `sadness`, `surprise` --- ## Training Data The model was trained via **cross-lingual label transfer** from English to Kabyle using parallel sentence pairs: 1. **Round-trip parallel corpus** (`eng_kab_roundtrip_good.tsv`) — 131,301 English-Kabyle sentence pairs with back-translation quality scores. 2. **Tatoeba parallel corpus** — 138,353 additional English-Kabyle linked sentences from tatoeba.org. **Labeling pipeline:** - English sentences were labeled with `cirimus/modernbert-base-go-emotions` (28-class GoEmotions classifier). - The single best GoEmotions label and its raw sigmoid confidence were transferred to the Kabyle side via sentence alignment. - Per-class adaptive thresholds and caps were applied to balance the dataset across all 28 labels. **Final balanced dataset:** - **Total labeled rows (raw):** ~204,000 - **Final training set:** 46,516 rows - **Validation set:** 6,203 rows - **Test set:** 9,304 rows --- ## Performance ### Validation Set (Epoch 5) | Metric | Score | |--------|-------| | **F1 (weighted)** | **0.817** | | **Accuracy** | **0.815** | ### Test Set Results (9,304 samples) | Emotion | Precision | Recall | F1-Score | Support | |---------|-----------|--------|----------|---------| | admiration | 0.663 | 0.523 | 0.585 | 900 | | amusement | 0.746 | 0.730 | 0.738 | 137 | | anger | 0.577 | 0.518 | 0.546 | 326 | | annoyance | 0.326 | 0.127 | 0.183 | 118 | | approval | 0.519 | 0.388 | 0.444 | 417 | | caring | 0.622 | 0.313 | 0.416 | 521 | | confusion | 0.701 | 0.653 | 0.676 | 288 | | **curiosity** | **0.938** | **0.977** | **0.957** | 1200 | | **desire** | **0.880** | **0.885** | **0.882** | 479 | | disappointment | 0.319 | 0.285 | 0.301 | 130 | | disapproval | 0.691 | 0.724 | 0.707 | 648 | | disgust | 0.108 | 0.061 | 0.078 | 66 | | embarrassment | 0.231 | 0.500 | 0.316 | 42 | | excitement | 0.201 | 0.243 | 0.220 | 111 | | fear | 0.738 | 0.684 | 0.710 | 247 | | **gratitude** | **0.957** | **0.892** | **0.923** | 148 | | grief | 0.273 | 0.882 | 0.417 | 17 | | joy | 0.677 | 0.417 | 0.516 | 357 | | **love** | **0.832** | **0.780** | **0.805** | 513 | | nervousness | 0.280 | 0.535 | 0.368 | 99 | | neutral | 0.579 | 0.833 | 0.683 | 1200 | | optimism | 0.502 | 0.779 | 0.611 | 280 | | pride | 0.476 | 0.833 | 0.606 | 36 | | realization | 0.150 | 0.570 | 0.237 | 100 | | relief | 0.111 | 0.071 | 0.087 | 14 | | **remorse** | **0.718** | **0.761** | **0.739** | 134 | | sadness | 0.537 | 0.225 | 0.317 | 547 | | surprise | 0.802 | 0.672 | 0.732 | 229 | - **Accuracy:** 0.648 - **Weighted Avg F1:** **0.641** - **Macro Avg F1:** 0.529 --- ## How to Use ### Quick inference with `transformers` ```python from transformers import pipeline classifier = pipeline( "text-classification", model="boffire/kabyle-emotion-afro-xlmr", device=0 # use -1 for CPU ) # Example sentences examples = [ "Ur d-yelli ara wid akken ttwali", "Lliɣ d aɣeznay i uqeddic-agi", "Ihi, ma yella, ad nerr", "Ahat ad yemmut umdan-nni", "Tameddakelt-iw tezwared-iyi", ] for text in examples: result = classifier(text, top_k=None) top = sorted(result[0], key=lambda x: x["score"], reverse=True)[0] print(f"{text} -> {top['label']} ({top['score']:.3f})") ``` ### Loading the model directly ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("boffire/kabyle-emotion-afro-xlmr") model = AutoModelForSequenceClassification.from_pretrained("boffire/kabyle-emotion-afro-xlmr") # Tokenize and predict inputs = tokenizer("Tura, Jeǧǧiga tesɛa 20 n yiseggasen.", return_tensors="pt", truncation=True) outputs = model(**inputs) ``` --- ## Training Details | Hyperparameter | Value | |----------------|-------| | Epochs | 5 (early stopping patience=2) | | Batch size | 16 per device (effective 64 with gradient accumulation) | | Gradient accumulation | 4 | | Learning rate | 2e-5 | | Max sequence length | 96 | | Weight decay | 0.01 | | Warmup steps | ~10% of total steps | | Optimizer | AdamW | | Class weights | Balanced (`sklearn.utils.class_weight.compute_class_weight`) | | Mixed precision | None (float32) | | Best checkpoint | Epoch 5 | --- ## Limitations & Caveats 1. **Silver labels:** Ground-truth emotions were projected from an English GoEmotions classifier. Some labels may not perfectly capture Kabyle cultural or emotional nuance. 2. **Rare class weakness:** Classes with very few test examples (`relief`: 14, `grief`: 17, `disgust`: 66) have low F1 scores. The model struggles to learn reliable patterns for these. 3. **Neutral class:** While `neutral` now comes from a real GoEmotions label (not synthetic uncertainty), it still dominates the raw distribution and is capped to 2,000 training examples. 4. **Translation quality:** The parallel corpus includes round-trip translated sentences. Imperfect translations may introduce label noise. 5. **No native speaker validation:** The test set was held out from the same silver-labeled pool. A small native-annotated benchmark would give a more accurate human ceiling. 6. **Domain limitation:** Training data comes from Tatoeba (simple, short sentences) and round-trip translations. Performance may degrade on longer, more complex Kabyle text (social media, literature, etc.). 7. **Kabyle not in AfroXLMR pre-training corpus:** AfroXLMR-Large was trained on 17 African languages, but Kabyle was not among them. The model relies on transfer from related Afro-Asiatic languages (e.g., Amharic, Arabic). --- ## Intended Use - **Research** in low-resource NLP and Afro-Asiatic / Amazigh language processing. - **Downstream applications** requiring fine-grained emotion signals in Kabyle text (e.g., content moderation, mental-health screening, customer feedback analysis). - **Baseline** for future Kabyle emotion models trained on native annotations. --- ## Citation If you use this model, please cite: ```bibtex @misc{boffire_kabyle_emotion_afro_xlmr, title = {Kabyle Emotion Classifier (AfroXLMR-Large + GoEmotions)}, author = {Boffire}, year = {2026}, howpublished = {\url{https://huggingface.co/boffire/kabyle-emotion-afro-xlmr}}, note = {Fine-tuned AfroXLMR-Large for 28-class GoEmotions detection in Kabyle via cross-lingual label transfer from English} } ``` --- ## Acknowledgments - **Davlan** for the `afro-xlmr-large` base model and African-centric pre-training. - **cirimus** for the `modernbert-base-go-emotions` English emotion classifier. - **Google Research** for the GoEmotions dataset. - **Tatoeba Project** for the English-Kabyle parallel corpus. - **Hugging Face** `transformers`, `datasets`, and `accelerate` teams for the training infrastructure. --- ## License This model is released under the **Apache 2.0** license. The base model (`Davlan/afro-xlmr-large`) and English emotion classifier (`cirimus/modernbert-base-go-emotions`) are subject to their respective **MIT** licenses. The GoEmotions dataset is **Apache 2.0**.