Instructions to use NAMAA-Space/AraModernBert-Base-V1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NAMAA-Space/AraModernBert-Base-V1.0 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="NAMAA-Space/AraModernBert-Base-V1.0")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("NAMAA-Space/AraModernBert-Base-V1.0") model = AutoModelForMaskedLM.from_pretrained("NAMAA-Space/AraModernBert-Base-V1.0") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - ar | |
| pipeline_tag: fill-mask | |
| library_name: transformers | |
| tags: | |
| - modernbert | |
| - arabic | |
| - fill-mask | |
| - long-context | |
| # AraModernBert-base-V1.0 | |
| ## Model Description | |
| AraModernBERT is an advanced Arabic language model built on the ModernBERT architecture. This model represents a significant advancement in Arabic language understanding, combining state-of-the-art transformer design innovations with extensive training on 100 Gigabytes of Arabic text. | |
| AraModernBERT was developed through the following process: | |
| 1. **Custom Tokenizer Training:** We built a specialized tokenizer optimized for Arabic language processing with a vocabulary size of 50,280 tokens. | |
| 2. **Transtokenization:** We used the transtokenization technique to optimally initialize the embedding layer for MLM (for more details, see the [paper](https://arxiv.org/abs/2408.04303) here). | |
| 3. **Large-Scale Masked Language Modeling** The model was trained on 100 Gigabytes of Arabic text. | |
| ## Model Configuration | |
| ```json | |
| { | |
| "hidden_size": 768, | |
| "intermediate_size": 1152, | |
| "num_attention_heads": 12, | |
| "num_hidden_layers": 22, | |
| "max_position_embeddings": 8192, | |
| "vocab_size": 50280, | |
| "global_attn_every_n_layers": 3, | |
| "local_attention": 128, | |
| "global_rope_theta": 160000.0, | |
| "local_rope_theta": 10000.0, | |
| "architectures": ["ModernBertForMaskedLM"], | |
| "model_type": "modernbert", | |
| "cls_token_id": 3, | |
| "mask_token_id": 6, | |
| "pad_token_id": 5, | |
| "sep_token_id": 4, | |
| "unk_token_id": 2 | |
| } | |
| ``` | |
| ## Intended Uses & Limitations | |
| AraModernBERT can be used for a wide range of Arabic NLP tasks, including: | |
| - **Text Embeddings & Representation** | |
| - **Information Retrieval** | |
| - **RAG (Retrieval Augmented Generation)** | |
| - **Document Similarity** | |
| - **Text Classification** | |
| - **Sentiment Analysis** | |
| ### Limitations and Biases | |
| - The model is optimized for Modern Standard Arabic and may show varying performance on dialectal Arabic variants or classical Arabic texts. | |
| - Performance may vary across domains and specialized terminology. | |
| - Users should be aware of potential biases present in the training data. | |
| ### Evaluation Results | |
|  | |
| #### 1. Semantic Textual Similarity (STS) | |
| We fine-tuned the model on STS datasets to enhance semantic understanding capabilities: | |
| - **STS17:** 0.831 | |
| - **STS22:** 0.617 | |
| *Note: The STS-optimized model will be released soon as a separate checkpoint.* | |
| #### 2. Text Classification | |
| We finetuned AraModernBERT on a multi-class classification task using the [SANAD](https://huggingface.co/datasets/arbml/SANAD) dataset. | |
| **Overall Metrics:** | |
| - **AraModernBERT:** | |
| - Accuracy: 94.32% | |
| - F1 Score: 94.31% | |
| - Precision: 94.31% | |
| - Recall: 94.32% | |
| **Per-Class Performance (AraModernBERT):** | |
| | Class | Precision | Recall | F1-Score | Support | | |
| |-------|-----------|--------|----------|---------| | |
| | 0 | 92.13% | 92.43% | 92.28% | 1,849 | | |
| | 1 | 93.63% | 93.70% | 93.67% | 3,937 | | |
| | 2 | 90.70% | 90.70% | 90.70% | 2,075 | | |
| | 3 | 96.30% | 93.81% | 95.04% | 776 | | |
| | 4 | 96.09% | 95.84% | 95.96% | 1,898 | | |
| | 5 | 89.24% | 87.99% | 88.61% | 641 | | |
| | 6 | 98.55% | 99.37% | 98.96% | 3,005 | | |
| #### 3. Named Entity Recognition (NER) | |
| The model achieved excellent performance on Arabic NER tasks: | |
| - **Accuracy:** 90.39% | |
| - **Precision:** 0.7357 | |
| - **Recall:** 0.7442 | |
| - **F1:** 0.7399 | |
| ## How to Use | |
| Here's how to use AraModernBERT with the Transformers library: | |
| ```python | |
| from transformers import AutoTokenizer, AutoModel | |
| # Load model and tokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("NAMAA-Space/AraModernBERT-Base-V1.0") | |
| model = AutoModel.from_pretrained("NAMAA-Space/AraModernBERT-Base-V1.0") | |
| # Encode text | |
| text = "مرحبا بكم في عالم الذكاء الاصطناعي" | |
| inputs = tokenizer(text, return_tensors="pt") | |
| outputs = model(**inputs) | |
| # Get embeddings | |
| embeddings = outputs.last_hidden_state | |
| ``` | |
| ### Masked Language Modeling Example | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForMaskedLM | |
| import torch | |
| tokenizer = AutoTokenizer.from_pretrained("NAMAA-Space/AraModernBERT-Base-V1.0") | |
| model = AutoModelForMaskedLM.from_pretrained("NAMAA-Space/AraModernBERT-Base-V1.0") | |
| text = "الذكاء الاصطناعي هو [MASK] المستقبل." | |
| inputs = tokenizer(text, return_tensors="pt") | |
| token_index = torch.where(inputs["input_ids"][0] == tokenizer.mask_token_id)[0] | |
| outputs = model(**inputs) | |
| predictions = outputs.logits | |
| predicted_token_id = torch.argmax(predictions[0, token_index]).item() | |
| predicted_token = tokenizer.decode(predicted_token_id) | |
| print(predicted_token) | |
| ``` | |
| ## Model Architecture | |
| AraModernBERT inherits the modern architecture features from ModernBERT, while adding Trans-Tokenization approach: | |
| - **22 transformer layers** with 768 hidden dimensions | |
| - **Alternating Attention mechanism** with global attention every 3 layers and a local attention window of 128 tokens | |
| - **Rotary Positional Embeddings (RoPE)** with different theta values for global (160000.0) and local (10000.0) attention | |
| - **8,192 token context window** for processing longer documents | |
| - **Specialized vocabulary** of 50,280 tokens optimized for Arabic | |
| ## Technical Specifications | |
| - **Base Architecture:** ModernBERT | |
| - **Parameters:** ~149M (based on configuration) | |
| - **Context Length:** 8,192 tokens | |
| - **Vocabulary Size:** 50,280 | |
| - **Hidden Size:** 768 | |
| - **Attention Heads:** 12 | |
| - **Hidden Layers:** 22 | |
| - **Intermediate Size:** 1152 | |
| ## Citation | |
| If you use this model in your research, please cite: | |
| ``` | |
| @inproceedings{elshehy-etal-2026-aramodernbert, | |
| title = "{A}ra{M}odern{BERT}: Transtokenized Initialization and Long-Context Encoder Modeling for {A}rabic", | |
| author = "Elshehy, Omar and | |
| Nacar, Omer and | |
| Djamai, Abdelbasset and | |
| Ragab, Muhammed and | |
| AL Jallad, Khloud and | |
| Abdelazim, Mona", | |
| editor = "El-Haj, Mo and | |
| Rayson, Paul and | |
| Jarrar, Mustafa and | |
| Ezeani, Ignatius and | |
| Ezzini, Saad and | |
| Ahmadi, Sina and | |
| Haddad Haddad, Amal and | |
| Amol, Cynthia and | |
| Abdelali, Ahmad and | |
| Abudalfa, Shadi", | |
| booktitle = "Proceedings of the 2nd Workshop on {NLP} for Languages Using {A}rabic Script", | |
| month = mar, | |
| year = "2026", | |
| address = "Rabat, Morocco", | |
| publisher = "Association for Computational Linguistics", | |
| url = "https://aclanthology.org/2026.abjadnlp-1.39/", | |
| doi = "10.18653/v1/2026.abjadnlp-1.39", | |
| pages = "313--321", | |
| } | |
| ``` | |
| ## Acknowledgements | |
| This model builds upon the ModernBERT architecture developed by Answer.AI and LightOn. We acknowledge their contributions to the field of encoder-only models and extend their work to the Arabic language through our novel Trans-Tokenized approach. | |
| ``` | |
| @misc{modernbert, | |
| title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference}, | |
| author={Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli}, | |
| year={2024}, | |
| eprint={2412.13663}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2412.13663}, | |
| } | |
| ``` | |
| ``` | |
| @inproceedings{remy-delobelle2024transtokenization, | |
| title={Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of {LLM}s for Low-Resource {NLP}}, | |
| author={Remy, Fran{\c{c}}ois and Delobelle, Pieter and Avetisyan, Hayastan and Khabibullina, Alfiya and de Lhoneux, Miryam and Demeester, Thomas}, | |
| booktitle={First Conference on Language Modeling}, | |
| year={2024}, | |
| url={https://openreview.net/forum?id=sBxvoDhvao} | |
| } | |
| ``` | |