---
license: mit
language:
- ar
metrics:
- f1
base_model:
- CAMeL-Lab/readability-arabertv2-d3tok-reg
tags:
- Arabic
---
# MorphoArabia at BAREC 2025 Shared Task: A Hybrid Architecture with Morphological Analysis for Arabic Readability Assessment
This repository contains the official models and results for **MorphoArabia**, the submission to the **[BAREC 2025 Shared Task](https://sites.google.com/view/barec-2025/home)** on Arabic Readability Assessment.
#### By: [Fatimah Mohamed Emad Elden](https://scholar.google.com/citations?user=CfX6eA8AAAAJ&hl=ar)
#### *Cairo University*
[](https://arxiv.org/abs/25XX.XXXXX)
[](https://github.com/astral-fate/barec-Arabic-Readability-Assessment)
[](https://huggingface.co/collections/FatimahEmadEldin/barec-shared-task-2025-689195853f581b9a60f9bd6c)
[](https://github.com/astral-fate/mentalqa2025/blob/main/LICENSE)
---
## Model Description
This project introduces a **morphologically-aware approach** for assessing the readability of Arabic text. The system is built around a fine-tuned regression model designed to process morphologically analyzed text. For the **Constrained** and **Open** tracks of the shared task, this core model is extended into a hybrid architecture that incorporates seven engineered lexical features.
A key element of this system is its deep morphological preprocessing pipeline. For the base models, this involves using the **CAMEL Tools `d3tok` analyzer** to capture linguistic complexities often missed by surface-level tokenization. This approach proved to be highly effective, achieving a peak **Quadratic Weighted Kappa (QWK) score of 84.2** on the strict sentence-level test set.
The model predicts a readability score on a **19-level scale**, from 1 (easiest) to 19 (hardest), for a given Arabic sentence or document.
-----
## 🚀 How to Use the Hybrid Model
This repository contains a fine-tuned hybrid model that combines a transformer's text understanding with explicit lexical features for a more robust readability assessment.
**NOTE:** This is a custom model architecture. You **must** use the `trust_remote_code=True` argument when loading it.
### Step 1: Installation
First, install all the necessary libraries. You will need `arabert` for the specific preprocessing steps this model requires.
```bash
pip install transformers torch pandas arabert
````
### Step 2: Preprocessing and Feature Engineering
To use the model correctly, you must replicate the same preprocessing and feature engineering steps used during training. The input text is first cleaned using the `AraBERT` preprocessor. Then, 7 lexical features are extracted based on the **SAMER Readability Lexicon**.
The 7 features are:
* **Character Count**: The total number of characters in the preprocessed text.
* **Word Count**: The total number of words in the text.
* **Average Word Length**: The average number of characters per word.
* **Average Word Difficulty**: The mean readability score of all words, based on the SAMER lexicon (defaulting to 3.0 for unknown words).
* **Maximum Word Difficulty**: The highest readability score of any single word in the text.
* **Difficult Word Count**: The number of words with a readability score greater than 4.
* **OOV Ratio**: The ratio of words in the text that are not found in the SAMER lexicon.
### Step 3: Full Inference Example
The following code provides a complete, runnable example for getting a prediction from a single sentence. It includes the necessary preprocessing functions.
```python
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModel
from arabert.preprocess import ArabertPreprocessor
# --- 1. Define the Feature Engineering Function ---
# This function must be defined to process your text.
def get_lexical_features(text, lexicon):
"""Calculates the 7 lexical features based on the SAMER lexicon."""
if not lexicon or not isinstance(text, str):
return [0.0] * 7
words = text.split()
if not words: return [0.0] * 7
# Default difficulty for words not in the lexicon is 3.0
word_difficulties = [lexicon.get(word, 3.0) for word in words]
features = [
float(len(text)),
float(len(words)),
float(np.mean([len(w) for w in words]) if words else 0.0),
float(np.mean(word_difficulties)),
float(np.max(word_difficulties)),
float(np.sum(np.array(word_difficulties) > 4)),
float(len([w for w in words if w not in lexicon]) / len(words))
]
return features
# --- 2. Initialize Models and Processors ---
repo_id = "FatimahEmadEldin/Constrained-Track-Sentence-Bassline-Readability-Arabertv2-d3tok-reg"
arabert_preprocessor = ArabertPreprocessor(model_name="aubmindlab/bert-large-arabertv2")
tokenizer = AutoTokenizer.from_pretrained(repo_id)
# Load the model with trust_remote_code=True to use the custom HybridRegressionModel
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
# --- 3. Prepare Input Text and Lexicon ---
# NOTE: For this example, we use a small, sample lexicon. For best results,
# you should load the full 'SAMER-Readability-Lexicon-v2.tsv' file.
sample_lexicon = {'جملة': 2.5, 'عربية': 3.1, 'بسيطة': 1.8}
text = "هذا مثال لجملة عربية بسيطة."
# --- 4. Run the Full Pipeline ---
# a. Preprocess the text
preprocessed_text = arabert_preprocessor.preprocess(text)
# b. Extract the 7 lexical features
numerical_features_list = get_lexical_features(preprocessed_text, sample_lexicon)
numerical_features = torch.tensor([numerical_features_list], dtype=torch.float)
# c. Tokenize the text
inputs = tokenizer(preprocessed_text, return_tensors="pt", padding=True, truncation=True)
# d. Add numerical features to the model's inputs
inputs['features'] = numerical_features
# --- 5. Perform Inference ---
model.eval() # Set the model to evaluation mode
with torch.no_grad():
logits = model(**inputs)
# --- 6. Process the Output ---
predicted_score = logits.item()
# Clip the score to the valid 0-18 range, then shift to the 1-19 final level
final_level = round(max(0, min(18, predicted_score))) + 1
print(f"Input Text: '{text}'")
print(f"Preprocessed Text: '{preprocessed_text}'")
print(f"Extracted Features: {numerical_features_list}")
print("-" * 30)
print(f"Raw Regression Score: {predicted_score:.4f}")
print(f"Predicted Readability Level (1-19): {final_level}")
```
## ⚙️ Training Procedure
The system employs two distinct architectures based on the track's constraints:
* **Strict Track**: This track uses a base regression model, `CAMeL-Lab/readability-arabertv2-d3tok-reg`, fine-tuned directly on the BAREC dataset.
* **Constrained and Open Tracks**: These tracks utilize a hybrid model. This architecture combines the deep contextual understanding of the Transformer with explicit numerical features. The final representation for a sentence is created by concatenating the Transformer's `[CLS]` token embedding with a 7-dimensional vector of engineered lexical features derived from the SAMER lexicon.
### Data and Hyperparameters
The model was trained on a combined dataset of **97,874 training records** and validated against **7,310 validation records**. The following key hyperparameters were used during training:
* **Epochs**: 8
* **Learning Rate**: 3e-5
* **Evaluation Batch Size**: 64
* **Warmup Ratio**: 0.1
* **Weight Decay**: 0.01
-----
### 📊 Evaluation Results
The models were evaluated on the blind test set provided by the BAREC organizers. The primary metric for evaluation is the **Quadratic Weighted Kappa (QWK)**, which penalizes larger disagreements more severely.
#### Final Test Set Scores (QWK)
| Track | Task | Dev (QWK) | Test (QWK) |
| :--- | :--- | :---: | :---: |
| **Strict** | Sentence | 0.823 | **84.2** |
| | Document | 0.823\* | 79.9 |
| **Constrained** | Sentence | 0.810 | 82.9 |
| | Document | 0.835\* | 75.5 |
| **Open** | Sentence | 0.827 | 83.6 |
| | Document | 0.827\* | **79.2** |
*\*Document-level dev scores are based on the performance of the sentence-level model on the validation set.*
-----
## 📜 Citation
If you use the work, please cite the paper:
```
@inproceedings{eldin2025morphoarabia,
title={{MorphoArabia at BAREC 2025 Shared Task: A Hybrid Architecture with Morphological Analysis for Arabic Readability Assessmen}},
author={Eldin, Fatimah Mohamed Emad},
year={2025},
booktitle={Proceedings of the BAREC 2025 Shared Task},
eprint={25XX.XXXXX},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```