--- license: cc-by-4.0 language: - fa library_name: coreml pipeline_tag: token-classification tags: - persian - farsi - pii - privacy - token-classification - coreml - int4 - openmed-persian base_model: PartAI/TookaBERT-Large --- # OpenMed Persian PII TookaBERT Large CoreML INT4 4-bit palettized CoreML export of a Persian PII token-classification model trained on the cleaned OpenMed Persian PII corpus. ## Status This repo contains a **verified CoreML INT4** artifact for Apple deployment. Important runtime contract: - `model.4bit-palettized.mlpackage` is fixed-shape batch `1`, sequence length `256`. - Inputs are `int32`: `input_ids`, `attention_mask`, and `token_type_ids`. - The verified CoreML graph uses an **all-ones attention mask**. Fill `attention_mask` with `1` for every position, including padded positions, then ignore special/pad offsets during span construction. - Use sliding windows for long text and merge by original character offsets. ## Metrics Original dense held-out test F1: **0.9800** CoreML contract quality check: PyTorch evaluation with the same all-ones attention behavior on the first 2,000 held-out `test` rows: ```json { "model_dir": "models/final_runs/dense-lowlr-combined-clean-20260530T222847-0700/PartAI__TookaBERT-Large", "dataset": "data/final_splits_audited/combined_clean", "split": "test", "rows": 2000, "max_length": 256, "batch_size": 32, "attention_mode": "all_ones", "device": "cuda", "precision": 0.9764901296875999, "recall": 0.978365230749536, "f1": 0.9774267809182761, "accuracy": 0.9945124547030911 } ``` CoreML parity verification: ```json { "attention_mode": "all_ones", "batch_size": 1, "max_length": 256, "fp32_argmax_match_rate": 1.0, "int4_argmax_match_rate": 0.9921875, "int4_max_abs_diff_vs_torch": 7.133697509765625 } ``` ## Recommended Inference Wrapper Production use should wrap the CoreML model with: 1. Sliding windows at `max_length=256`, with overlap/stride around `96`. 2. Token-offset span reconstruction; ignore special tokens and zero-length offsets. 3. Whitespace trimming and overlap de-duplication. 4. High-precision regex/rule assists for emails, Iranian mobile/phone numbers, national IDs, postal codes, dates, card numbers, and IMEI exclusion. 5. Cue-word correction for labels near `کد ملی`, `گواهینامه`, `گذرنامه`, `کدپستی`, `شماره تماس`, and `ایمیل`. See `inference_coreml.py` and `CoreMLWrapperContract.swift` for minimal wrapper contracts. ## Known Edge Cases - Do not use a standard padding attention mask with this CoreML export. The verified path uses all-ones attention because CoreMLTools' converted transformer mask path did not preserve PyTorch parity. - Long documents require sliding-window inference. - Obfuscated contacts and verbal/spaced phone numbers need deterministic normalization/rules. - IMEI-like device IDs can look like card numbers; validate card numbers before masking as `CREDITCARDNUMBER`. Best Persian-script dense model, but this CoreML contract drops slightly versus ONNX; trim leading whitespace spans in postprocessing. ## Files - `model.4bit-palettized.mlpackage`: verified 4-bit CoreML model. - `verification.json`: fixture-level CoreML parity verification. - `coreml_allones_hf_eval_test_2000.json`: quality check for the CoreML attention contract. - tokenizer/config files from the trained checkpoint. - `reports/`: ad hoc edgecase reports from the dense model.