Kabyle LoRA Adapter for AfriqueQwen3.5-4B

Fine-tuned LoRA adapter for Kabyle (kab) text generation, based on McGill-NLP/AfriqueQwen3.5-4B-ExtendedCM-ExtendedLangs.

This adapter eliminates the language confusion (Swahili, Somali, Igbo bleeding) present in the base model and produces coherent, grammatically correct Kabyle sentences.

Training Details

Parameter	Value
Base model	McGill-NLP/AfriqueQwen3.5-4B-ExtendedCM-ExtendedLangs
Training data	326,147 filtered Kabyle sentences from Tatoeba
Raw data source	Tatoeba sentences dump (sentences.tar.bz2)
Fine-tuning steps	1000
Final training loss	0.665
LoRA rank (r)	8
LoRA alpha	16
Target modules	q_proj, v_proj
Max sequence length	64
Per-device batch size	1
Gradient accumulation	4 (effective batch = 4)
Learning rate	2e-4
Optimizer	adamw_8bit
Quantization	4-bit (NF4) with double quantization
Gradient checkpointing	Enabled
Hardware	NVIDIA T4 (Google Colab Free Tier)

Dataset Filtering

Source: Tatoeba sentences dump - 787,648 raw Kabyle sentences (lang == "kab")
Quality filters applied:
- Sentence length: 3-50 words
- Minimum 2 Kabyle-specific characters: ɣ ṭ ḍ č ǧ ɛ ḥ ṛ ṣ ẓ
- Contamination removal: excluded sentences containing Greek/Cyrillic look-alikes:
  - Greek epsilon ε (U+03B5)
  - Cyrillic epsilon Ԑ (U+0510), ԑ (U+0511)
  - Greek gamma γ (U+03B3), Γ (U+0393)
Result: 326,147 clean, filtered Kabyle sentences

Results

Before vs After Fine-Tuning

Prompt	Base Model (Before)	Fine-Tuned (Step 1000)
Ur bɣiɣ ara	ad aɣ-yeǧǧen ad d-nernu iḍelli, ma d am (nonsense)	ad t-ttɣeṣbeḍ.
Anwa i	iji ihe eji eme ihe na-eme (Igbo/Yoruba!)	d-yeǧǧan Ḥmed ad yeddu?
Lemmer ad	ziwa kuti adzafunikanso (Swahili!)	yuɣal ɣer-sen, ad d-yuɣal deg-sen.
Aqcic i	iyo Faysal oo ka hadlay Dastuurka Soomaaliya (Somali!)	iḍelli.
Aselkim-iw ur	d-yeɣli yiwen (nonsense)	yelli ara deg wexxam.
Ass-a d	ass n 15 di yennayer 2019 (date regurgitation)	dɣa, ur d-ḥewwweṣeɣ ara.

Loss Progression

Step	Training Loss
600	0.712
850	0.706
950	0.677
1000	0.665

Key improvements:

Zero language confusion - no more Swahili, Somali, or Igbo bleeding
Grammatically correct - proper preverbs (ad, d-), clitic pronouns (t-, i), possessives (-iw)
Semantically coherent - sentences make sense in context
Natural endings - completes with periods and logical conclusions
Cultural references - recognizes Kabyle names (Ḥmed) and places (Iɣil Azwaw)

Usage

Requirements

pip install transformers peft accelerate bitsandbytes torch

Load and Generate

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

BASE_MODEL = "McGill-NLP/AfriqueQwen3.5-4B-ExtendedCM-ExtendedLangs"
ADAPTER_MODEL = "boffire/AfriqueQwen3.5-4B-Kabyle-LoRA"

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load tokenizer and base model
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, ADAPTER_MODEL)

# Generate Kabyle text
prompt = "Taqbaylit d"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=25,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id
)
completion = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(completion)
# Output: Taqbaylit d tamaziɣt.

Example Prompts

test_prompts = [
    "Ur bɣiɣ ara",
    "Taqbaylit-iw d",
    "Anwa i",
    "Nekkni n",
    "Aselkim-iw ur",
    "Iḍelli tella",
    "Lemmer ad",
    "Aqcic i",
    "Ur tett ara",
    "Ass-a d",
]

Repository Contents

File	Description
adapter_config.json	LoRA hyperparameters
adapter_model.safetensors	Trained LoRA weights (~2-4 MB)
tokenizer_config.json	Tokenizer configuration
tokenizer.json	Tokenizer vocabulary
special_tokens_map.json	Special token mappings
README.md	This file

Base Model

This adapter is designed to be used with:

Base: McGill-NLP/AfriqueQwen3.5-4B-ExtendedCM-ExtendedLangs
Type: Causal Language Model (Base/Pre-trained)
Parameters: 4B
Context Length: 32,768 tokens
African Languages: 50 (including Kabyle)

Citation

If you use this model, please cite:

AfriqueLLM paper:

@misc{yu2026afriquellmdatamixingmodel,
  title={AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages},
  author={Hao Yu and Tianyi Xu and Michael A. Hedderich and Wassim Hamidouche and Syed Waqas Zamir and David Ifeoluwa Adelani},
  year={2026},
  eprint={2601.06395},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2601.06395}
}

Tatoeba project:

https://tatoeba.org

Limitations

Model Architecture

Base model, not chat: This adapter performs text completion only. It does not follow instructions, answer questions, or engage in conversation. For chat capabilities, further supervised fine-tuning (SFT) with instruction-response pairs is required.
No chat template: The model does not recognize system/user/assistant roles. Inputs are treated as raw text to be continued.
Small LoRA rank (r=8): Only ~917K parameters are trainable (0.02% of total). Complex reasoning, multi-step logic, or rare morphological patterns may be beyond capacity.

Training and Data

Short context window: Trained on max 64 tokens. Generation beyond ~64 tokens may degrade in quality or repeat phrases.
Domain limited: Training data (Tatoeba) consists primarily of short, general-domain sentences. The model lacks exposure to technical, legal, medical, or academic Kabyle.
No dialect awareness: Does not distinguish between Kabyle sub-dialects (At Mengellat, At Weɣlis, Tasaḥlit, etc.). Output may blend dialectal features.
Script limited: Trained exclusively on Latin-script Kabyle. Tifinagh and Arabic-script Kabyle are not supported.

Semantic Coherence

Grammatically correct but semantically inconsistent: The model reliably generates morphologically valid Kabyle (proper preverbs, clitics, possessives) but may produce sentences that are grammatically well-formed yet semantically odd or factually nonsensical. This is expected behavior for a base causal language model fine-tuned on text completion rather than instruction-following or reasoning tasks.
Examples of semantic drift:
- "Aselkim-iw ur yelli ara deg wexxam" (grammatical but semantically odd: "My computer is not at home")
- "Taqbaylit-iw d taɛrabt i tt-yeččan" (contradiction: "My Kabyle was eaten by Arabic")
Cause: The model predicts the most statistically likely next token based on training patterns. It does not understand meaning, fact-check, or reason about the real world. Lower sampling temperature or greedy decoding (do_sample=False) improves consistency at the cost of creativity.

Content and Safety

No safety filtering: The model has no built-in guardrails. It may generate toxic, biased, harmful, or culturally inappropriate content if prompted.
Hallucination risk: As a base language model, it has no factual grounding. It may invent false information about Kabyle history, people, places, or events.
Gender bias: Training data may reflect historical gender stereotypes present in Tatoeba contributions.
Temporal cutoff: Factual knowledge is limited to the base model's training cutoff. No awareness of events after approximately 2024.

Multilingual Behavior

French interference: Despite fine-tuning, the base model's strong French knowledge may cause French words to appear in Kabyle completions, especially for modern/technical concepts.
Code-switching untested: Natural Kabyle-French code-switching (common in Algeria) is not explicitly handled and may produce unpredictable results.
No translation capability: The model was not trained on parallel data. English to Kabyle or Kabyle to French translation will be unreliable.

Compute and Deployment

4-bit quantization required: The adapter assumes 4-bit (NF4) base model loading. Running in full precision requires ~8GB VRAM; FP16 requires ~16GB.
GPU recommended: CPU inference is possible but extremely slow (~10-30x slower than T4 GPU).

License

This model is released under the CC BY 4.0 License (same as the base model).

Acknowledgments

McGill-NLP for the AfriqueLLM suite
Tatoeba contributors for the Kabyle sentence corpus
Qwen team for the base architecture
Hugging Face for the PEFT and Transformers libraries

Downloads last month: -

Model tree for boffire/AfriqueQwen3.5-4B-Kabyle-LoRA

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

McGill-NLP/AfriqueQwen3.5-4B-50Langs

Adapter

(1)

this model

Paper for boffire/AfriqueQwen3.5-4B-Kabyle-LoRA

AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

Paper • 2601.06395 • Published Jan 10 • 3