AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages
Paper • 2601.06395 • Published • 3
How to use boffire/AfriqueQwen3.5-4B-Kabyle-LoRA with PEFT:
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("McGill-NLP/AfriqueQwen3.5-4B-ExtendedCM-ExtendedLangs")
model = PeftModel.from_pretrained(base_model, "boffire/AfriqueQwen3.5-4B-Kabyle-LoRA")Fine-tuned LoRA adapter for Kabyle (kab) text generation, based on McGill-NLP/AfriqueQwen3.5-4B-ExtendedCM-ExtendedLangs.
This adapter eliminates the language confusion (Swahili, Somali, Igbo bleeding) present in the base model and produces coherent, grammatically correct Kabyle sentences.
| Parameter | Value |
|---|---|
| Base model | McGill-NLP/AfriqueQwen3.5-4B-ExtendedCM-ExtendedLangs |
| Training data | 326,147 filtered Kabyle sentences from Tatoeba |
| Raw data source | Tatoeba sentences dump (sentences.tar.bz2) |
| Fine-tuning steps | 1000 |
| Final training loss | 0.665 |
| LoRA rank (r) | 8 |
| LoRA alpha | 16 |
| Target modules | q_proj, v_proj |
| Max sequence length | 64 |
| Per-device batch size | 1 |
| Gradient accumulation | 4 (effective batch = 4) |
| Learning rate | 2e-4 |
| Optimizer | adamw_8bit |
| Quantization | 4-bit (NF4) with double quantization |
| Gradient checkpointing | Enabled |
| Hardware | NVIDIA T4 (Google Colab Free Tier) |
| Prompt | Base Model (Before) | Fine-Tuned (Step 1000) |
|---|---|---|
| Ur bɣiɣ ara | ad aɣ-yeǧǧen ad d-nernu iḍelli, ma d am (nonsense) | ad t-ttɣeṣbeḍ. |
| Anwa i | iji ihe eji eme ihe na-eme (Igbo/Yoruba!) | d-yeǧǧan Ḥmed ad yeddu? |
| Lemmer ad | ziwa kuti adzafunikanso (Swahili!) | yuɣal ɣer-sen, ad d-yuɣal deg-sen. |
| Aqcic i | iyo Faysal oo ka hadlay Dastuurka Soomaaliya (Somali!) | iḍelli. |
| Aselkim-iw ur | d-yeɣli yiwen (nonsense) | yelli ara deg wexxam. |
| Ass-a d | ass n 15 di yennayer 2019 (date regurgitation) | dɣa, ur d-ḥewwweṣeɣ ara. |
| Step | Training Loss |
|---|---|
| 600 | 0.712 |
| 850 | 0.706 |
| 950 | 0.677 |
| 1000 | 0.665 |
Key improvements:
pip install transformers peft accelerate bitsandbytes torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
BASE_MODEL = "McGill-NLP/AfriqueQwen3.5-4B-ExtendedCM-ExtendedLangs"
ADAPTER_MODEL = "boffire/AfriqueQwen3.5-4B-Kabyle-LoRA"
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
# Load tokenizer and base model
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, ADAPTER_MODEL)
# Generate Kabyle text
prompt = "Taqbaylit d"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=25,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id
)
completion = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(completion)
# Output: Taqbaylit d tamaziɣt.
test_prompts = [
"Ur bɣiɣ ara",
"Taqbaylit-iw d",
"Anwa i",
"Nekkni n",
"Aselkim-iw ur",
"Iḍelli tella",
"Lemmer ad",
"Aqcic i",
"Ur tett ara",
"Ass-a d",
]
| File | Description |
|---|---|
| adapter_config.json | LoRA hyperparameters |
| adapter_model.safetensors | Trained LoRA weights (~2-4 MB) |
| tokenizer_config.json | Tokenizer configuration |
| tokenizer.json | Tokenizer vocabulary |
| special_tokens_map.json | Special token mappings |
| README.md | This file |
This adapter is designed to be used with:
If you use this model, please cite:
AfriqueLLM paper:
@misc{yu2026afriquellmdatamixingmodel,
title={AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages},
author={Hao Yu and Tianyi Xu and Michael A. Hedderich and Wassim Hamidouche and Syed Waqas Zamir and David Ifeoluwa Adelani},
year={2026},
eprint={2601.06395},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.06395}
}
Tatoeba project:
do_sample=False) improves consistency at the cost of creativity.This model is released under the CC BY 4.0 License (same as the base model).
Base model
Qwen/Qwen3.5-4B-Base