--- language: - ur license: mit library_name: transformers pipeline_tag: fill-mask tags: - urdu - roberta - masked-language-modeling - encoder - dunbaabert datasets: - uonlp/CulturaX --- # DunbaaBERT DunbaaBERT is a family of Urdu RoBERTa-base encoder models trained from scratch on a deduplicated 17 GB Urdu corpus. The models use Byte-BPE vocabularies of 32k, 52k, and 96k tokens and are released under the MIT license. ## Model Details - **Model type:** RoBERTa-style masked language model - **Language:** Urdu - **Architecture:** Encoder-only Transformer - **Training objective:** Masked Language Modeling with Whole Word Masking (WWM) - **Sequence length:** 512 tokens - **Training corpus:** 17 GB deduplicated Urdu text ## Model Variants | Model | Vocabulary Size | Parameters | |---------|---------:|---------:| | **DunbaaBERT-32k** | 32,009 | 110,625,024 | | DunbaaBERT-52k | 52,009 | 125,985,024 | | DunbaaBERT-96k | 96,009 | 159,777,024 | ## Training Data The final corpus was constructed from multiple Urdu resources and deduplicated at line level. | Corpus | Size | |----------|---------:| | mC4 | 17.0 GB | | OSCAR-2019 | 869 MB | | OSCAR-2109 | 604 MB | | OSCAR-2201 | 344 MB | | OSCAR-2301 | 982 MB | | Urdu Wikipedia | 364 MB | | Filtered NLLB Urdu | 2.1 GB | | **Total before deduplication** | **22.3 GB** | | **Final corpus** | **17.0 GB** | ## Pre-training - 32k vocab size - 100k training steps - computed on 2x H100 with 8k batch size ## Evaluation Results ### Main Results | Model | UrBLiMP | COUNT19 F1 | USADC F1 | PSL-Kabaddi F1 | IMDB Urdu F1 | Avg. Norm. Eff. | |---------|---------:|---------:|---------:|---------:|---------:|---------:| | DunbaaBERT-32k | 95.1 | 94.44 | **94.08** | 70.08 | 90.13 | **0.859** | | DunbaaBERT-52k | **97.0** | 94.91 | 91.75 | 67.60 | 90.14 | 0.795 | | DunbaaBERT-96k | 94.6 | 95.22 | 89.97 | 70.53 | 90.65 | 0.813 | | Urdu-RoBERTa-small | 90.5 | 92.08 | 85.36 | 67.06 | 84.72 | 0.781 | | HPLT-BERT-ur | **97.3** | **95.71** | 93.51 | **71.11** | 89.69 | 0.597 | | mBERT | 75.5 | 90.88 | 83.03 | 65.78 | 85.47 | 0.744 | | mmBERT-small | 89.5 | 92.36 | 73.09 | 70.36 | 85.44 | 0.494 | | mmBERT-base | 92.4 | 93.97 | 77.77 | 67.75 | 87.31 | 0.495 | | XLM-R-base | 89.6 | 93.72 | 85.22 | 60.56 | 88.69 | 0.754 | | XLM-R-large | 94.1 | 94.38 | 83.55 | 69.62 | **91.15** | 0.492 | ### Efficiency We report a normalized efficiency metric combining Macro-F1 and inference throughput. Across benchmarks, the DunbaaBERT family consistently achieved stronger performance-efficiency trade-offs than most multilingual baselines. DunbaaBERT-52k achieved the strongest linguistic probing performance on UrBLiMP, while DunbaaBERT-32k provided the strongest overall efficiency profile. Interestingly, DunbaaBERT-96k ranked second in average efficiency despite having the largest vocabulary. ## Fairseq Checkpoint Get the fairseq checkpoint [here](https://drive.proton.me/urls/CBZ3JEK138#ACWZfQGdVev0). ## Citation ```bibtex @misc{maab2026dunbaabertsacrificesemantics, title={DunbaaBERT: From Sacrifice to Semantics}, author={Iffat Maab and Waleed Jamil and Raphael Schmitt}, year={2026}, eprint={2605.26935}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2605.26935}, } ``` ## License MIT License