OpenMed Persian🇮🇷
Collection
OpenMed Persian models and datasets for Persian/Iranian privacy, PII masking, and medical-adjacent language infrastructure. • 10 items • Updated
4-bit palettized CoreML export of a Persian PII token-classification model trained on the cleaned OpenMed Persian PII corpus.
This repo contains a verified CoreML INT4 artifact for Apple deployment.
Important runtime contract:
model.4bit-palettized.mlpackage is fixed-shape batch 1, sequence length 256.int32: input_ids, attention_mask, and token_type_ids.attention_mask with 1 for every position, including padded positions, then ignore special/pad offsets during span construction.Original dense held-out test F1: 0.9800
CoreML contract quality check: PyTorch evaluation with the same all-ones attention behavior on the first 2,000 held-out test rows:
{
"model_dir": "models/final_runs/dense-lowlr-combined-clean-20260530T222847-0700/PartAI__TookaBERT-Large",
"dataset": "data/final_splits_audited/combined_clean",
"split": "test",
"rows": 2000,
"max_length": 256,
"batch_size": 32,
"attention_mode": "all_ones",
"device": "cuda",
"precision": 0.9764901296875999,
"recall": 0.978365230749536,
"f1": 0.9774267809182761,
"accuracy": 0.9945124547030911
}
CoreML parity verification:
{
"attention_mode": "all_ones",
"batch_size": 1,
"max_length": 256,
"fp32_argmax_match_rate": 1.0,
"int4_argmax_match_rate": 0.9921875,
"int4_max_abs_diff_vs_torch": 7.133697509765625
}
Production use should wrap the CoreML model with:
max_length=256, with overlap/stride around 96.کد ملی, گواهینامه, گذرنامه, کدپستی, شماره تماس, and ایمیل.See inference_coreml.py and CoreMLWrapperContract.swift for minimal wrapper contracts.
CREDITCARDNUMBER.Best Persian-script dense model, but this CoreML contract drops slightly versus ONNX; trim leading whitespace spans in postprocessing.
model.4bit-palettized.mlpackage: verified 4-bit CoreML model.verification.json: fixture-level CoreML parity verification.coreml_allones_hf_eval_test_2000.json: quality check for the CoreML attention contract.reports/: ad hoc edgecase reports from the dense model.Base model
PartAI/TookaBERT-Large