---
language:
- ur
license: mit
library_name: transformers
pipeline_tag: fill-mask
tags:
- urdu
- roberta
- masked-language-modeling
- encoder
- dunbaabert
datasets:
- uonlp/CulturaX
---

# DunbaaBERT

DunbaaBERT is a family of Urdu RoBERTa-base encoder models trained from scratch on a deduplicated 17 GB Urdu corpus. The models use Byte-BPE vocabularies of 32k, 52k, and 96k tokens and are released under the MIT license.

## Model Details

- **Model type:** RoBERTa-style masked language model
- **Language:** Urdu
- **Architecture:** Encoder-only Transformer
- **Training objective:** Masked Language Modeling with Whole Word Masking (WWM)
- **Sequence length:** 512 tokens
- **Training corpus:** 17 GB deduplicated Urdu text

## Model Variants

| Model | Vocabulary Size | Parameters |
|---------|---------:|---------:|
| **DunbaaBERT-32k** | 32,009 | 110,625,024 |
| DunbaaBERT-52k | 52,009 | 125,985,024 |
| DunbaaBERT-96k | 96,009 | 159,777,024 |

## Training Data

The final corpus was constructed from multiple Urdu resources and deduplicated at line level.

| Corpus | Size |
|----------|---------:|
| mC4 | 17.0 GB |
| OSCAR-2019 | 869 MB |
| OSCAR-2109 | 604 MB |
| OSCAR-2201 | 344 MB |
| OSCAR-2301 | 982 MB |
| Urdu Wikipedia | 364 MB |
| Filtered NLLB Urdu | 2.1 GB |
| **Total before deduplication** | **22.3 GB** |
| **Final corpus** | **17.0 GB** |

## Pre-training
- 32k vocab size
- 100k training steps
- computed on 2x H100 with 8k batch size

## Evaluation Results

### Main Results

| Model | UrBLiMP | COUNT19 F1 | USADC F1 | PSL-Kabaddi F1 | IMDB Urdu F1 | Avg. Norm. Eff. |
|---------|---------:|---------:|---------:|---------:|---------:|---------:|
| DunbaaBERT-32k | 95.1 | 94.44 | **94.08** | 70.08 | 90.13 | **0.859** |
| DunbaaBERT-52k | **97.0** | 94.91 | 91.75 | 67.60 | 90.14 | 0.795 |
| DunbaaBERT-96k | 94.6 | 95.22 | 89.97 | 70.53 | 90.65 | 0.813 |
| Urdu-RoBERTa-small | 90.5 | 92.08 | 85.36 | 67.06 | 84.72 | 0.781 |
| HPLT-BERT-ur | **97.3** | **95.71** | 93.51 | **71.11** | 89.69 | 0.597 |
| mBERT | 75.5 | 90.88 | 83.03 | 65.78 | 85.47 | 0.744 |
| mmBERT-small | 89.5 | 92.36 | 73.09 | 70.36 | 85.44 | 0.494 |
| mmBERT-base | 92.4 | 93.97 | 77.77 | 67.75 | 87.31 | 0.495 |
| XLM-R-base | 89.6 | 93.72 | 85.22 | 60.56 | 88.69 | 0.754 |
| XLM-R-large | 94.1 | 94.38 | 83.55 | 69.62 | **91.15** | 0.492 |

### Efficiency

We report a normalized efficiency metric combining Macro-F1 and inference throughput.
Across benchmarks, the DunbaaBERT family consistently achieved stronger performance-efficiency trade-offs than most multilingual baselines.

DunbaaBERT-52k achieved the strongest linguistic probing performance on UrBLiMP, while DunbaaBERT-32k provided the strongest overall efficiency profile. 
Interestingly, DunbaaBERT-96k ranked second in average efficiency despite having the largest vocabulary.

## Fairseq Checkpoint
Get the fairseq checkpoint [here](https://drive.proton.me/urls/CBZ3JEK138#ACWZfQGdVev0).

## Citation

```bibtex
@misc{maab2026dunbaabertsacrificesemantics,
      title={DunbaaBERT: From Sacrifice to Semantics}, 
      author={Iffat Maab and Waleed Jamil and Raphael Schmitt},
      year={2026},
      eprint={2605.26935},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.26935}, 
}
```

## License

MIT License