Kabyle POS Tagger v2.0

(Part-of-Speech) POS tagger for Kabyle (Taqbaylit), an Amazigh language spoken in Algeria and in some parts of the world.

Model Details

Attribute Value
Architecture XLM-RoBERTa-base
Task Token-level POS tagging (17 UD labels)
Training Data 10,000 sentences (8,000 train / 1,000 dev / 1,000 test)
Lexicon Size 214,092 entries from conjugation tables + corpus mining
Raw Corpus ~209,000 sentences from web crawls and texts
Tokenizer XLM-RoBERTa + custom Kabyle BPE (boffire/Kabyle-BPE-Tokenizer-v2)
Framework Transformers 4.x, PyTorch

Performance

Evaluated on held-out test set (1,000 sentences):

Metric Score
Accuracy 94.8%
F1 (macro) 93.8%
Precision 93.8%
Recall 93.8%
Loss 0.258

Dev set (1,000 sentences):

Metric Score
Accuracy 95.1%
F1 (macro) 94.3%
Precision 94.3%
Recall 94.4%
Loss 0.266

Supported POS Tags (17 UD labels)

Tag Description Examples
ADJ Adjective ameqran, tamezwarut
ADP Adposition deg, ɣer, si, fell-
ADV Adverb tura, azekka, akka, bezzaf
AUX Auxiliary (rare in Kabyle)
CCONJ Coordinating conjunction neɣ, dɣa, maca
DET Determiner yal, kull, aṭas, kra
INTJ Interjection ttxil-k, acemma
NOUN Noun argaz, tamsalt, aɣerbaz
NUM Numeral yiwen, sin, kraḍ, miya
PART Particle ad, ur, ara, mačči
PRON Pronoun nekk, kečč, netta, acu
PROPN Proper noun Tom, Mary, Didac
PUNCT Punctuation ., ?, !, ,, -
SCONJ Subordinating conjunction ma, iwakken, acku, ɣas
SYM Symbol (rare)
VERB Verb yella, yusa, yenna, yebɣa
X Other unknown / unclassified

Usage

Python

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

# Load model and tokenizer
model_name = "boffire/kabyle-pos-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create pipeline
nlp = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

# Tag a sentence
sentence = "Aql-iyi d nekk."
results = nlp(sentence)

for token in results:
    print(f"{token['word']:15s} {token['entity_group']}")

Output

Aql-iyi         PRON
d               ADP
e               PRON  (clitic)
nekk            PRON
.               PUNCT

Command Line

python -m transformers_cli run     --task token-classification     --model boffire/kabyle-pos-v2     --text "Aql-iyi d nekk."

Training Details

Data Pipeline

  1. Raw text collection: ~209,000 Kabyle sentences from web sources
  2. Pre-annotation: 214,092-entry seed lexicon from:
    • 6,198 verb conjugation tables (conjugation.json)
    • Morphology-based noun/adjective/pronoun extraction
    • Clitic bundle heuristics
  3. Stratified sampling: 10,000 sentences split by length (short/medium/long)
  4. Manual review: Top X-tagged tokens corrected iteratively

Training Hyperparameters

Parameter Value
Base model xlm-roberta-base
Epochs 10
Batch size 32
Learning rate 5e-5
Weight decay 0.01
Warmup ratio 0.1
Max sequence length 128
Optimizer AdamW
Scheduler linear with warmup
Training time ~41 minutes

Loss Curve

  • Initial loss: ~1.8
  • Final loss: 0.258 (train), 0.266 (dev)

Limitations

  1. Clitic bundles: Complex forms like k-id-xeẓẓren or iyi-d-yewwi may still be tagged as X if not in training data
  2. Arabic loans: Words like ddreɛ, lqahwa may be misclassified
  3. Proper nouns: Limited coverage of place names and modern names
  4. Dialectal variation: Model trained on standard Kabyle; regional variants may differ
  5. X-tag rate: ~4.5% of tokens in test set remain unclassified (X)

Comparison to Previous Work

Version Train Size Test F1 Notes
v1.0 ~1,800 sent 86.3% Initial release
v2.0 10,000 sent 93.8% +7.5% improvement

Citation

If you use this model, please cite:

@misc{kabyle-pos-v2,
  title={Kabyle POS Tagger v2.0: Scaling to 10,000 Sentences},
  author={boffire},
  year={2026},
  howpublished={\url{https://huggingface.co/boffire/kabyle-pos-v2}}
}

Acknowledgments

  • XLM-RoBERTa: Conneau et al. (2020), Facebook AI
  • MasakhaPOS: Dione et al. (2023) for African language POS benchmarks
  • Kabyle conjugation data: Community-contributed verb tables
  • Tatoeba: For Kabyle text resources

License

MIT License — free for research and commercial use.

Contact


Last updated: 2026-06-01

Downloads last month
71
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using boffire/kabyle-pos-v2 1