---
license: cc-by-4.0
language:
- fa
library_name: coreml
pipeline_tag: token-classification
tags:
- persian
- farsi
- pii
- privacy
- token-classification
- coreml
- int4
- openmed-persian
base_model: PartAI/TookaBERT-Large
---

# OpenMed Persian PII TookaBERT Large CoreML INT4

4-bit palettized CoreML export of a Persian PII token-classification model trained on the cleaned OpenMed Persian PII corpus.

## Status

This repo contains a **verified CoreML INT4** artifact for Apple deployment.

Important runtime contract:

- `model.4bit-palettized.mlpackage` is fixed-shape batch `1`, sequence length `256`.
- Inputs are `int32`: `input_ids`, `attention_mask`, and `token_type_ids`.
- The verified CoreML graph uses an **all-ones attention mask**. Fill `attention_mask` with `1` for every position, including padded positions, then ignore special/pad offsets during span construction.
- Use sliding windows for long text and merge by original character offsets.

## Metrics

Original dense held-out test F1: **0.9800**

CoreML contract quality check: PyTorch evaluation with the same all-ones attention behavior on the first 2,000 held-out `test` rows:

```json
{
  "model_dir": "models/final_runs/dense-lowlr-combined-clean-20260530T222847-0700/PartAI__TookaBERT-Large",
  "dataset": "data/final_splits_audited/combined_clean",
  "split": "test",
  "rows": 2000,
  "max_length": 256,
  "batch_size": 32,
  "attention_mode": "all_ones",
  "device": "cuda",
  "precision": 0.9764901296875999,
  "recall": 0.978365230749536,
  "f1": 0.9774267809182761,
  "accuracy": 0.9945124547030911
}
```

CoreML parity verification:

```json
{
  "attention_mode": "all_ones",
  "batch_size": 1,
  "max_length": 256,
  "fp32_argmax_match_rate": 1.0,
  "int4_argmax_match_rate": 0.9921875,
  "int4_max_abs_diff_vs_torch": 7.133697509765625
}
```

## Recommended Inference Wrapper

Production use should wrap the CoreML model with:

1. Sliding windows at `max_length=256`, with overlap/stride around `96`.
2. Token-offset span reconstruction; ignore special tokens and zero-length offsets.
3. Whitespace trimming and overlap de-duplication.
4. High-precision regex/rule assists for emails, Iranian mobile/phone numbers, national IDs, postal codes, dates, card numbers, and IMEI exclusion.
5. Cue-word correction for labels near `کد ملی`, `گواهینامه`, `گذرنامه`, `کدپستی`, `شماره تماس`, and `ایمیل`.

See `inference_coreml.py` and `CoreMLWrapperContract.swift` for minimal wrapper contracts.

## Known Edge Cases

- Do not use a standard padding attention mask with this CoreML export. The verified path uses all-ones attention because CoreMLTools' converted transformer mask path did not preserve PyTorch parity.
- Long documents require sliding-window inference.
- Obfuscated contacts and verbal/spaced phone numbers need deterministic normalization/rules.
- IMEI-like device IDs can look like card numbers; validate card numbers before masking as `CREDITCARDNUMBER`.

Best Persian-script dense model, but this CoreML contract drops slightly versus ONNX; trim leading whitespace spans in postprocessing.

## Files

- `model.4bit-palettized.mlpackage`: verified 4-bit CoreML model.
- `verification.json`: fixture-level CoreML parity verification.
- `coreml_allones_hf_eval_test_2000.json`: quality check for the CoreML attention contract.
- tokenizer/config files from the trained checkpoint.
- `reports/`: ad hoc edgecase reports from the dense model.