AraBERT-MLM (DAPT on AHQAD/AHD)

This repository releases a domain-adapted AraBERT checkpoint continued with masked language modeling (DAPT-MLM) on the AHQAD/AHD Arabic health question–answer corpus.
The model is intended for constrained clinical question reformulation via mask filling (e.g., replacing a placeholder with one or more [MASK] tokens and predicting only the masked positions).

Model ID

USERNAME/REPO_NAME

Training data

  • AHQAD/AHD Arabic health QA corpus (≈808k Q–A pairs, ~90 specialties).
  • Used under the original terms of the dataset.

Intended use

  • Arabic clinical question rewriting/reformulation using span completion (mask filling).
  • A front-end module for Arabic clinical QA pipelines (retrieval/generation) to improve question clarity and completeness.

How to use (Transformers)

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

repo_id = "USERNAME/REPO_NAME"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForMaskedLM.from_pretrained(repo_id)

text = "عندي ألم في ___ منذ أسبوع؟"
masked = text.replace("___", tokenizer.mask_token)

inputs = tokenizer(masked, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

mask_index = (inputs["input_ids"][0] == tokenizer.mask_token_id).nonzero(as_tuple=True)[0][0].item()
pred_id = logits[0, mask_index].argmax(-1).item()
print("Prediction:", tokenizer.decode([pred_id]))
Downloads last month
2
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support