Kabyle POS Tagger v2.0
(Part-of-Speech) POS tagger for Kabyle (Taqbaylit), an Amazigh language spoken in Algeria and in some parts of the world.
Model Details
| Attribute | Value |
|---|---|
| Architecture | XLM-RoBERTa-base |
| Task | Token-level POS tagging (17 UD labels) |
| Training Data | 10,000 sentences (8,000 train / 1,000 dev / 1,000 test) |
| Lexicon Size | 214,092 entries from conjugation tables + corpus mining |
| Raw Corpus | ~209,000 sentences from web crawls and texts |
| Tokenizer | XLM-RoBERTa + custom Kabyle BPE (boffire/Kabyle-BPE-Tokenizer-v2) |
| Framework | Transformers 4.x, PyTorch |
Performance
Evaluated on held-out test set (1,000 sentences):
| Metric | Score |
|---|---|
| Accuracy | 94.8% |
| F1 (macro) | 93.8% |
| Precision | 93.8% |
| Recall | 93.8% |
| Loss | 0.258 |
Dev set (1,000 sentences):
| Metric | Score |
|---|---|
| Accuracy | 95.1% |
| F1 (macro) | 94.3% |
| Precision | 94.3% |
| Recall | 94.4% |
| Loss | 0.266 |
Supported POS Tags (17 UD labels)
| Tag | Description | Examples |
|---|---|---|
| ADJ | Adjective | ameqran, tamezwarut |
| ADP | Adposition | deg, ɣer, si, fell- |
| ADV | Adverb | tura, azekka, akka, bezzaf |
| AUX | Auxiliary | (rare in Kabyle) |
| CCONJ | Coordinating conjunction | neɣ, dɣa, maca |
| DET | Determiner | yal, kull, aṭas, kra |
| INTJ | Interjection | ttxil-k, acemma |
| NOUN | Noun | argaz, tamsalt, aɣerbaz |
| NUM | Numeral | yiwen, sin, kraḍ, miya |
| PART | Particle | ad, ur, ara, mačči |
| PRON | Pronoun | nekk, kečč, netta, acu |
| PROPN | Proper noun | Tom, Mary, Didac |
| PUNCT | Punctuation | ., ?, !, ,, - |
| SCONJ | Subordinating conjunction | ma, iwakken, acku, ɣas |
| SYM | Symbol | (rare) |
| VERB | Verb | yella, yusa, yenna, yebɣa |
| X | Other | unknown / unclassified |
Usage
Python
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
# Load model and tokenizer
model_name = "boffire/kabyle-pos-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Create pipeline
nlp = pipeline(
"token-classification",
model=model,
tokenizer=tokenizer,
aggregation_strategy="simple"
)
# Tag a sentence
sentence = "Aql-iyi d nekk."
results = nlp(sentence)
for token in results:
print(f"{token['word']:15s} {token['entity_group']}")
Output
Aql-iyi PRON
d ADP
e PRON (clitic)
nekk PRON
. PUNCT
Command Line
python -m transformers_cli run --task token-classification --model boffire/kabyle-pos-v2 --text "Aql-iyi d nekk."
Training Details
Data Pipeline
- Raw text collection: ~209,000 Kabyle sentences from web sources
- Pre-annotation: 214,092-entry seed lexicon from:
- 6,198 verb conjugation tables (conjugation.json)
- Morphology-based noun/adjective/pronoun extraction
- Clitic bundle heuristics
- Stratified sampling: 10,000 sentences split by length (short/medium/long)
- Manual review: Top X-tagged tokens corrected iteratively
Training Hyperparameters
| Parameter | Value |
|---|---|
| Base model | xlm-roberta-base |
| Epochs | 10 |
| Batch size | 32 |
| Learning rate | 5e-5 |
| Weight decay | 0.01 |
| Warmup ratio | 0.1 |
| Max sequence length | 128 |
| Optimizer | AdamW |
| Scheduler | linear with warmup |
| Training time | ~41 minutes |
Loss Curve
- Initial loss: ~1.8
- Final loss: 0.258 (train), 0.266 (dev)
Limitations
- Clitic bundles: Complex forms like
k-id-xeẓẓrenoriyi-d-yewwimay still be tagged as X if not in training data - Arabic loans: Words like
ddreɛ,lqahwamay be misclassified - Proper nouns: Limited coverage of place names and modern names
- Dialectal variation: Model trained on standard Kabyle; regional variants may differ
- X-tag rate: ~4.5% of tokens in test set remain unclassified (X)
Comparison to Previous Work
| Version | Train Size | Test F1 | Notes |
|---|---|---|---|
| v1.0 | ~1,800 sent | 86.3% | Initial release |
| v2.0 | 10,000 sent | 93.8% | +7.5% improvement |
Citation
If you use this model, please cite:
@misc{kabyle-pos-v2,
title={Kabyle POS Tagger v2.0: Scaling to 10,000 Sentences},
author={boffire},
year={2026},
howpublished={\url{https://huggingface.co/boffire/kabyle-pos-v2}}
}
Acknowledgments
- XLM-RoBERTa: Conneau et al. (2020), Facebook AI
- MasakhaPOS: Dione et al. (2023) for African language POS benchmarks
- Kabyle conjugation data: Community-contributed verb tables
- Tatoeba: For Kabyle text resources
License
MIT License — free for research and commercial use.
Contact
- Hugging Face: @boffire
- Issues: Discussions
Last updated: 2026-06-01
- Downloads last month
- 71