Kabyle POS Tagger v2.0

(Part-of-Speech) POS tagger for Kabyle (Taqbaylit), an Amazigh language spoken in Algeria and in some parts of the world.

Model Details

Attribute	Value
Architecture	XLM-RoBERTa-base
Task	Token-level POS tagging (17 UD labels)
Training Data	10,000 sentences (8,000 train / 1,000 dev / 1,000 test)
Lexicon Size	214,092 entries from conjugation tables + corpus mining
Raw Corpus	~209,000 sentences from web crawls and texts
Tokenizer	XLM-RoBERTa + custom Kabyle BPE (boffire/Kabyle-BPE-Tokenizer-v2)
Framework	Transformers 4.x, PyTorch

Performance

Evaluated on held-out test set (1,000 sentences):

Metric	Score
Accuracy	94.8%
F1 (macro)	93.8%
Precision	93.8%
Recall	93.8%
Loss	0.258

Dev set (1,000 sentences):

Metric	Score
Accuracy	95.1%
F1 (macro)	94.3%
Precision	94.3%
Recall	94.4%
Loss	0.266

Supported POS Tags (17 UD labels)

Tag	Description	Examples
ADJ	Adjective	ameqran, tamezwarut
ADP	Adposition	deg, ɣer, si, fell-
ADV	Adverb	tura, azekka, akka, bezzaf
AUX	Auxiliary	(rare in Kabyle)
CCONJ	Coordinating conjunction	neɣ, dɣa, maca
DET	Determiner	yal, kull, aṭas, kra
INTJ	Interjection	ttxil-k, acemma
NOUN	Noun	argaz, tamsalt, aɣerbaz
NUM	Numeral	yiwen, sin, kraḍ, miya
PART	Particle	ad, ur, ara, mačči
PRON	Pronoun	nekk, kečč, netta, acu
PROPN	Proper noun	Tom, Mary, Didac
PUNCT	Punctuation	., ?, !, ,, -
SCONJ	Subordinating conjunction	ma, iwakken, acku, ɣas
SYM	Symbol	(rare)
VERB	Verb	yella, yusa, yenna, yebɣa
X	Other	unknown / unclassified

Usage

Python

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

# Load model and tokenizer
model_name = "boffire/kabyle-pos-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create pipeline
nlp = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

# Tag a sentence
sentence = "Aql-iyi d nekk."
results = nlp(sentence)

for token in results:
    print(f"{token['word']:15s} {token['entity_group']}")

Output

Aql-iyi         PRON
d               ADP
e               PRON  (clitic)
nekk            PRON
.               PUNCT

Command Line

python -m transformers_cli run     --task token-classification     --model boffire/kabyle-pos-v2     --text "Aql-iyi d nekk."

Training Details

Data Pipeline

Raw text collection: ~209,000 Kabyle sentences from web sources
Pre-annotation: 214,092-entry seed lexicon from:
- 6,198 verb conjugation tables (conjugation.json)
- Morphology-based noun/adjective/pronoun extraction
- Clitic bundle heuristics
Stratified sampling: 10,000 sentences split by length (short/medium/long)
Manual review: Top X-tagged tokens corrected iteratively

Training Hyperparameters

Parameter	Value
Base model	xlm-roberta-base
Epochs	10
Batch size	32
Learning rate	5e-5
Weight decay	0.01
Warmup ratio	0.1
Max sequence length	128
Optimizer	AdamW
Scheduler	linear with warmup
Training time	~41 minutes

Loss Curve

Initial loss: ~1.8
Final loss: 0.258 (train), 0.266 (dev)

Limitations

Clitic bundles: Complex forms like k-id-xeẓẓren or iyi-d-yewwi may still be tagged as X if not in training data
Arabic loans: Words like ddreɛ, lqahwa may be misclassified
Proper nouns: Limited coverage of place names and modern names
Dialectal variation: Model trained on standard Kabyle; regional variants may differ
X-tag rate: ~4.5% of tokens in test set remain unclassified (X)

Comparison to Previous Work

Version	Train Size	Test F1	Notes
v1.0	~1,800 sent	86.3%	Initial release
v2.0	10,000 sent	93.8%	+7.5% improvement

Citation

If you use this model, please cite:

@misc{kabyle-pos-v2,
  title={Kabyle POS Tagger v2.0: Scaling to 10,000 Sentences},
  author={boffire},
  year={2026},
  howpublished={\url{https://huggingface.co/boffire/kabyle-pos-v2}}
}

Acknowledgments

XLM-RoBERTa: Conneau et al. (2020), Facebook AI
MasakhaPOS: Dione et al. (2023) for African language POS benchmarks
Kabyle conjugation data: Community-contributed verb tables
Tatoeba: For Kabyle text resources

License

MIT License — free for research and commercial use.

Contact

Hugging Face: @boffire
Issues: Discussions

Last updated: 2026-06-01

Downloads last month: 71

Safetensors

Model size

0.3B params

Tensor type

F32

boffire
/

kabyle-pos-v2