Kabyle LoRA Adapter for AfriqueQwen3.5-4B

Fine-tuned LoRA adapter for Kabyle (kab) text generation, based on McGill-NLP/AfriqueQwen3.5-4B-ExtendedCM-ExtendedLangs.

This adapter eliminates the language confusion (Swahili, Somali, Igbo bleeding) present in the base model and produces coherent, grammatically correct Kabyle sentences.


Training Details

Parameter Value
Base model McGill-NLP/AfriqueQwen3.5-4B-ExtendedCM-ExtendedLangs
Training data 326,147 filtered Kabyle sentences from Tatoeba
Raw data source Tatoeba sentences dump (sentences.tar.bz2)
Fine-tuning steps 1000
Final training loss 0.665
LoRA rank (r) 8
LoRA alpha 16
Target modules q_proj, v_proj
Max sequence length 64
Per-device batch size 1
Gradient accumulation 4 (effective batch = 4)
Learning rate 2e-4
Optimizer adamw_8bit
Quantization 4-bit (NF4) with double quantization
Gradient checkpointing Enabled
Hardware NVIDIA T4 (Google Colab Free Tier)

Dataset Filtering

  • Source: Tatoeba sentences dump - 787,648 raw Kabyle sentences (lang == "kab")
  • Quality filters applied:
    • Sentence length: 3-50 words
    • Minimum 2 Kabyle-specific characters: ɣ ṭ ḍ č ǧ ɛ ḥ ṛ ṣ ẓ
    • Contamination removal: excluded sentences containing Greek/Cyrillic look-alikes:
      • Greek epsilon ε (U+03B5)
      • Cyrillic epsilon Ԑ (U+0510), ԑ (U+0511)
      • Greek gamma γ (U+03B3), Γ (U+0393)
  • Result: 326,147 clean, filtered Kabyle sentences

Results

Before vs After Fine-Tuning

Prompt Base Model (Before) Fine-Tuned (Step 1000)
Ur bɣiɣ ara ad aɣ-yeǧǧen ad d-nernu iḍelli, ma d am (nonsense) ad t-ttɣeṣbeḍ.
Anwa i iji ihe eji eme ihe na-eme (Igbo/Yoruba!) d-yeǧǧan Ḥmed ad yeddu?
Lemmer ad ziwa kuti adzafunikanso (Swahili!) yuɣal ɣer-sen, ad d-yuɣal deg-sen.
Aqcic i iyo Faysal oo ka hadlay Dastuurka Soomaaliya (Somali!) iḍelli.
Aselkim-iw ur d-yeɣli yiwen (nonsense) yelli ara deg wexxam.
Ass-a d ass n 15 di yennayer 2019 (date regurgitation) dɣa, ur d-ḥewwweṣeɣ ara.

Loss Progression

Step Training Loss
600 0.712
850 0.706
950 0.677
1000 0.665

Key improvements:

  • Zero language confusion - no more Swahili, Somali, or Igbo bleeding
  • Grammatically correct - proper preverbs (ad, d-), clitic pronouns (t-, i), possessives (-iw)
  • Semantically coherent - sentences make sense in context
  • Natural endings - completes with periods and logical conclusions
  • Cultural references - recognizes Kabyle names (Ḥmed) and places (Iɣil Azwaw)

Usage

Requirements

pip install transformers peft accelerate bitsandbytes torch

Load and Generate

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

BASE_MODEL = "McGill-NLP/AfriqueQwen3.5-4B-ExtendedCM-ExtendedLangs"
ADAPTER_MODEL = "boffire/AfriqueQwen3.5-4B-Kabyle-LoRA"

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load tokenizer and base model
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, ADAPTER_MODEL)

# Generate Kabyle text
prompt = "Taqbaylit d"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=25,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id
)
completion = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(completion)
# Output: Taqbaylit d tamaziɣt.

Example Prompts

test_prompts = [
    "Ur bɣiɣ ara",
    "Taqbaylit-iw d",
    "Anwa i",
    "Nekkni n",
    "Aselkim-iw ur",
    "Iḍelli tella",
    "Lemmer ad",
    "Aqcic i",
    "Ur tett ara",
    "Ass-a d",
]

Repository Contents

File Description
adapter_config.json LoRA hyperparameters
adapter_model.safetensors Trained LoRA weights (~2-4 MB)
tokenizer_config.json Tokenizer configuration
tokenizer.json Tokenizer vocabulary
special_tokens_map.json Special token mappings
README.md This file

Base Model

This adapter is designed to be used with:


Citation

If you use this model, please cite:

AfriqueLLM paper:

@misc{yu2026afriquellmdatamixingmodel,
  title={AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages},
  author={Hao Yu and Tianyi Xu and Michael A. Hedderich and Wassim Hamidouche and Syed Waqas Zamir and David Ifeoluwa Adelani},
  year={2026},
  eprint={2601.06395},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2601.06395}
}

Tatoeba project:


Limitations

Model Architecture

  • Base model, not chat: This adapter performs text completion only. It does not follow instructions, answer questions, or engage in conversation. For chat capabilities, further supervised fine-tuning (SFT) with instruction-response pairs is required.
  • No chat template: The model does not recognize system/user/assistant roles. Inputs are treated as raw text to be continued.
  • Small LoRA rank (r=8): Only ~917K parameters are trainable (0.02% of total). Complex reasoning, multi-step logic, or rare morphological patterns may be beyond capacity.

Training and Data

  • Short context window: Trained on max 64 tokens. Generation beyond ~64 tokens may degrade in quality or repeat phrases.
  • Domain limited: Training data (Tatoeba) consists primarily of short, general-domain sentences. The model lacks exposure to technical, legal, medical, or academic Kabyle.
  • No dialect awareness: Does not distinguish between Kabyle sub-dialects (At Mengellat, At Weɣlis, Tasaḥlit, etc.). Output may blend dialectal features.
  • Script limited: Trained exclusively on Latin-script Kabyle. Tifinagh and Arabic-script Kabyle are not supported.

Semantic Coherence

  • Grammatically correct but semantically inconsistent: The model reliably generates morphologically valid Kabyle (proper preverbs, clitics, possessives) but may produce sentences that are grammatically well-formed yet semantically odd or factually nonsensical. This is expected behavior for a base causal language model fine-tuned on text completion rather than instruction-following or reasoning tasks.
  • Examples of semantic drift:
    • "Aselkim-iw ur yelli ara deg wexxam" (grammatical but semantically odd: "My computer is not at home")
    • "Taqbaylit-iw d taɛrabt i tt-yeččan" (contradiction: "My Kabyle was eaten by Arabic")
  • Cause: The model predicts the most statistically likely next token based on training patterns. It does not understand meaning, fact-check, or reason about the real world. Lower sampling temperature or greedy decoding (do_sample=False) improves consistency at the cost of creativity.

Content and Safety

  • No safety filtering: The model has no built-in guardrails. It may generate toxic, biased, harmful, or culturally inappropriate content if prompted.
  • Hallucination risk: As a base language model, it has no factual grounding. It may invent false information about Kabyle history, people, places, or events.
  • Gender bias: Training data may reflect historical gender stereotypes present in Tatoeba contributions.
  • Temporal cutoff: Factual knowledge is limited to the base model's training cutoff. No awareness of events after approximately 2024.

Multilingual Behavior

  • French interference: Despite fine-tuning, the base model's strong French knowledge may cause French words to appear in Kabyle completions, especially for modern/technical concepts.
  • Code-switching untested: Natural Kabyle-French code-switching (common in Algeria) is not explicitly handled and may produce unpredictable results.
  • No translation capability: The model was not trained on parallel data. English to Kabyle or Kabyle to French translation will be unreliable.

Compute and Deployment

  • 4-bit quantization required: The adapter assumes 4-bit (NF4) base model loading. Running in full precision requires ~8GB VRAM; FP16 requires ~16GB.
  • GPU recommended: CPU inference is possible but extremely slow (~10-30x slower than T4 GPU).

License

This model is released under the CC BY 4.0 License (same as the base model).


Acknowledgments

  • McGill-NLP for the AfriqueLLM suite
  • Tatoeba contributors for the Kabyle sentence corpus
  • Qwen team for the base architecture
  • Hugging Face for the PEFT and Transformers libraries
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for boffire/AfriqueQwen3.5-4B-Kabyle-LoRA

Adapter
(1)
this model

Paper for boffire/AfriqueQwen3.5-4B-Kabyle-LoRA