File size: 15,587 Bytes

945cfdd
 
 
b343316
 
945cfdd
b343316
945cfdd
b343316
 
 
 
 
 
 
 
 
945cfdd
 
b343316
 
 
 
 
 
 
 
 
 
 
 
 
 
945cfdd
 
b343316
945cfdd
b343316
945cfdd
b343316
945cfdd
b343316
 
 
 
 
 
 
 
 
 
 
 
 
 
945cfdd
b343316
945cfdd
b343316
945cfdd
b343316
945cfdd
b343316
945cfdd
b343316
945cfdd
b343316
945cfdd
b343316
 
 
945cfdd
b343316
945cfdd
b343316
 
 
 
 
 
 
 
 
 
 
 
 
 
945cfdd
b343316
945cfdd
b343316
945cfdd
b343316
945cfdd
b343316

---
library_name: transformers
license: mit
language:
- en
base_model: FacebookAI/roberta-base
pipeline_tag: token-classification
tags:
- token-classification
- ner
- insurance
- billing
- healthcare
- roberta
- clarioscope
metrics:
- f1
model-index:
- name: clarioscope-insurance-v1
  results:
  - task:
      type: token-classification
      name: Insurance / billing field extraction
    dataset:
      type: synthetic
      name: clarioscope-insurance-suite
    metrics:
    - type: macro-f1
      value: 0.7882
    - type: weighted-f1
      value: 0.8202
    - type: latency_ms_per_example
      value: 45.4
---

# ClarioScope insurance extractor v1

A 125M-parameter [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base) fine-tune that extracts **structured insurance and billing fields** from inbound patient text. It tags spans across 12 field types — carrier, plan type, member ID, group number, claim ID, copay, deductible, billed amount, and more — and produces output that downstream billing systems can ingest as JSON. This is **model 3 of the [ClarioScope SLM Suite](https://huggingface.co/raihan-js)** — a three-model intake intelligence pipeline for healthcare practices.

## TL;DR

| Property | Value |
|---|---|
| Task | Token classification over 12 insurance / billing field types (BIO tags, 25 labels) |
| Base model | [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) (125M params) |
| Training data | 8,086 synthetic examples (8,086 train / 897 val after cleanup) generated via `gpt-4o-mini-2024-07-18` |
| Test data | 672 synthetic examples generated by Claude Haiku 4.5 with a deliberately different prompt style |
| Val macro F1 | 95.71% |
| Val weighted F1 | 97.06% |
| **Test macro F1 vs GPT-4o** | **78.82%** vs gpt-4o-2024-11-20 95.62% |
| **Where fine-tune ties GPT-4o** | `SUBSCRIBER_NAME` (89 vs 91), `CLAIM_ID` (95 vs 100), `MEMBER_ID` (91 vs 99), `CARRIER` (90 vs 96) |
| **Where it loses badly** | `AUTH_NUMBER` (30 vs 99), `PLAN_TYPE` (69 vs 96) — low-frequency fields with high format variance |
| **Test latency** | **45.4 ms/example on CPU** — 26× faster than GPT-4o |
| **Per-inference cost** | **$0** self-hosted vs $1.90 per 1K for GPT-4o |
| License | MIT |

> **Note on benchmark scope.** This release benchmarks against GPT-4o only. Anthropic API credit was exhausted during the development cycle before the Claude Haiku 4.5 / Sonnet 4.6 comparisons could be run. A subsequent revision of this card will add the Anthropic numbers.

![Macro F1 vs latency](https://huggingface.co/raihan-js/clarioscope-insurance-v1/resolve/main/f1_vs_latency.png)

## Why this model exists

Patient inquiries about insurance, billing, and prior authorizations have a different shape from the rest of intake. The information value is in the structured fields — carrier name, member ID, group number, claim reference, copay amount — not in the surrounding prose. A practice's billing system can act on `{"carrier": "Aetna", "plan_type": "PPO", "member_id": "AET-998-2210"}` directly; it cannot act on "hi can you check if my Aetna PPO covers..." until those fields are extracted.

Frontier LLMs do this extraction well, with one consistent caveat — they cost real money per call, add ~1 second of latency, and send patient text to a third party. This model is the self-hosted alternative: 26× lower latency, $0 per inference after training, and patient text never leaves the host. It trails GPT-4o by 17 macro F1 points on aggregate, but it ties GPT-4o on the four fields with the most volume (`MEMBER_ID`, `CARRIER`, `CLAIM_ID`, `SUBSCRIBER_NAME`).

This is model 3 of the ClarioScope SLM Suite:

1. **Intent classifier** ([`clarioscope-intent-deberta-v1`](https://huggingface.co/raihan-js/clarioscope-intent-deberta-v1)) — what does this message want?
2. **PHI detector** ([`clarioscope-phi-deberta-v1`](https://huggingface.co/raihan-js/clarioscope-phi-deberta-v1)) — where is protected information in this message?
3. **Insurance extractor** (this model) — what billing-relevant structured data is in this message?

## The 12 field types

| Label | What it captures |
|---|---|
| `CARRIER` | Insurance company name (Aetna, Blue Cross Blue Shield, UnitedHealthcare, Cigna, Medicare, Medicaid, Kaiser Permanente, Humana, Anthem) |
| `PLAN_TYPE` | Plan format or tier (PPO, HMO, EPO, HDHP, POS, Gold / Silver / Bronze / Platinum) |
| `MEMBER_ID` | The patient's subscriber / member ID on the insurance card |
| `GROUP_NUMBER` | Group / employer group number |
| `POLICY_NUMBER` | Distinct policy or contract number when separate from member ID |
| `SUBSCRIBER_NAME` | Name of the primary subscriber on the policy |
| `RELATIONSHIP` | Relationship to subscriber (self, spouse, child, dependent, parent) |
| `CLAIM_ID` | Claim reference number |
| `AUTH_NUMBER` | Prior authorization / pre-cert number |
| `COPAY` | Copay dollar amount (includes the `$`) |
| `DEDUCTIBLE` | Deductible dollar amount (includes the `$`) |
| `BILLED_AMOUNT` | Total amount billed, owed, or charged (includes the `$`) |

The model outputs BIO token labels (25 total: `O` + 12 × `{B-, I-}`), which downstream code converts back into character-offset spans and then into a JSON object.

## Architecture

Standard RoBERTa-base encoder with a token-classification head: a linear layer over each token's contextualized representation, producing 25 logits per token. All 125M parameters are fine-tuned. Training uses fp32 (not mixed precision — see the PHI detector's card for the NaN-gradient story that motivated this), `adamw_torch` optimizer, `max_grad_norm=1.0`, an explicit classifier-head re-init (`std=0.02` normal, zero bias), batch size 8, sequence length 256, learning rate 1e-5 with cosine schedule and 10% warmup, weight decay 0.01, and early stopping (patience 2 on macro F1). Five epochs run in ~10 minutes on a single RTX A4000.

## Training data — synthetic, transparent about it

All training and evaluation data is **synthetic**. There is **no real patient billing data** in this model or its evaluation.

**Training set (8,086 examples after cleanup, from 8,120 originally).** Generated via the OpenAI API (`gpt-4o-mini-2024-07-18`, JSON-object response format, temperature 1.0) across 12 field types × healthcare practice types × channels. The generation prompt enforces a 40/40/20 realism mix (polished / casual / messy) and channel-specific scaling (SMS messiest, voicemail second messiest, email and web forms cleaner).

**Data cleanup.** Same cue-word noise pattern observed in the PHI detector: `gpt-4o-mini` sometimes returns entity texts that include the cue word (`"member ID AET-998-2210"` instead of `"AET-998-2210"`, `"copay $35"` instead of `"$35"`). `clean_data.py` in the repo strips these cue-word prefixes and re-locates the cleaned text in the source. 1,393 entities (~7.4% of all spans) were normalized during cleanup; 256 entities became empty after stripping and were dropped.

**Test set (672 examples).** Generated by Claude Haiku 4.5 with a deliberately different prompt style and abbreviation set (`w/`, `&`, `ins`, `mbr`, `grp` versus the train prompt's more formal style). This cross-generator split mitigates benchmark leakage from training the model on one generator's style.

**Anti-leakage validation.** The cross-generator split produces a large val/test gap (val macro F1 0.96; test macro F1 0.79). The val set comes from the same `gpt-4o-mini` distribution as training; the test set comes from a different generator. The 17-point gap is the cost of generalizing across generators — and a useful proxy for the gap that would exist between this model and a real-world distribution. Val numbers from same-generator splits routinely overstate real-world generalization on tasks like this.

## Benchmark — entity-level F1, span boundaries matter

Evaluated on the 672-example held-out test set using [seqeval](https://github.com/chakki-works/seqeval), which requires both the entity type AND the exact span boundary to match for a true positive.

| Model | Macro F1 | Weighted F1 | Latency / example | Cost / 1K inferences |
|---|---|---|---|---|
| **`raihan-js/clarioscope-insurance-v1` (CPU)** | **0.7882** | **0.8202** | **45.4 ms** | **$0.00** |
| `gpt-4o-2024-11-20` | 0.9562 | 0.9572 | 1202.3 ms | $1.90 |

![F1 and cost comparison](https://huggingface.co/raihan-js/clarioscope-insurance-v1/resolve/main/f1_vs_cost.png)

GPT-4o wins on aggregate, but the per-entity breakdown shows the gap is concentrated in a few low-frequency fields.

## Per-field breakdown

![Per-field F1: fine-tuned vs GPT-4o](https://huggingface.co/raihan-js/clarioscope-insurance-v1/resolve/main/per_entity_f1.png)

**High-volume fields where fine-tune is competitive:**

| Field | Fine-tune | GPT-4o | Test support |
|---|---|---|---|
| `CLAIM_ID` | 0.954 | 0.997 | 146 |
| `MEMBER_ID` | 0.914 | 0.990 | 431 |
| `CARRIER` | 0.905 | 0.964 | 685 |
| `SUBSCRIBER_NAME` | 0.894 | 0.906 | 196 |
| `COPAY` | 0.864 | 0.954 | 179 |
| `BILLED_AMOUNT` | 0.847 | 0.971 | 219 |

`SUBSCRIBER_NAME` is essentially tied with GPT-4o (0.89 vs 0.91). The four highest-volume fields (`CARRIER`, `MEMBER_ID`, `CLAIM_ID`, `BILLED_AMOUNT`) collectively cover ~70% of the test entities, and the fine-tune is within 5–13 points of GPT-4o on each.

**Mid-volume fields with moderate gaps:**

| Field | Fine-tune | GPT-4o | Test support |
|---|---|---|---|
| `POLICY_NUMBER` | 0.827 | 0.985 | 97 |
| `GROUP_NUMBER` | 0.804 | 0.995 | 216 |
| `DEDUCTIBLE` | 0.758 | 0.964 | 190 |

`GROUP_NUMBER` has a 19-point gap despite 216 training-supported examples in test. The format variance for group numbers is wide (`4421`, `001428`, `GRP-882044`, `Group #44210`) and the cross-generator split exposes the train-set format bias.

**Low-volume / low-F1 fields:**

| Field | Fine-tune | GPT-4o | Test support |
|---|---|---|---|
| `RELATIONSHIP` | 0.703 | 0.802 | 207 |
| `PLAN_TYPE` | 0.688 | 0.956 | 359 |
| `AUTH_NUMBER` | 0.300 | 0.991 | 115 |

`AUTH_NUMBER` is the headline weakness — 0.30 vs 0.99. The training set has only 770 AUTH_NUMBER spans total, and the format space is wide (`PA-4421`, `auth #998-2210`, `AUTH998212`, etc.). A v2 with more AUTH_NUMBER coverage in training would likely close most of this gap.

`RELATIONSHIP` is a hard category for both models — short string ("self", "spouse", "child"), often overlapping with other entity contexts, with a tight span boundary that's easy to miss by one token.

## Recommended production pattern: hybrid

For a billing pipeline that processes inbound patient messages:

1. **Run this model first** on every message. Captures `CARRIER`, `MEMBER_ID`, `CLAIM_ID`, `SUBSCRIBER_NAME` with near-frontier F1 at ~45 ms / message on CPU.
2. **Add regex for high-value structured patterns**: dollar amounts (`\$[\d,]+(?:\.\d{2})?`), member ID format checks specific to your top carriers (Aetna IDs, BCBS IDs all follow distinct patterns).
3. **Use GPT-4o as a fallback** for messages where the fine-tune is uncertain or for AUTH_NUMBER / PLAN_TYPE detection. The fallback should fire on ~10–20% of messages, not 100%.

This pattern is the same architecture recommended for the PHI detector. The fine-tune does the bulk-volume linguistic work; frontier APIs handle the long tail. Together they cost an order of magnitude less than running frontier on every message.

## How to use

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_id = "raihan-js/clarioscope-insurance-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
model.eval()

text = ("Hi, I'd like to verify my coverage before tomorrow's appointment. "
        "My carrier is Aetna PPO, member ID AET-998-2210, group #4421. "
        "I'm the subscriber. My copay should be $35 — is that right?")

enc = tokenizer(text, return_offsets_mapping=True, return_tensors="pt", truncation=True, max_length=256)
offsets = enc.pop("offset_mapping")[0].tolist()
with torch.no_grad():
    pred_ids = model(**enc).logits.argmax(dim=-1)[0].tolist()

id2label = model.config.id2label
spans = []
i = 0
while i < len(pred_ids):
    label = id2label[pred_ids[i]]
    if label.startswith("B-"):
        ent_type = label[2:]
        start = offsets[i][0]
        end = offsets[i][1]
        j = i + 1
        while j < len(pred_ids) and id2label[pred_ids[j]] == f"I-{ent_type}":
            end = offsets[j][1]
            j += 1
        spans.append({"text": text[start:end], "label": ent_type})
        i = j
    else:
        i += 1

# Convert to structured JSON for downstream billing system
extracted = {span["label"]: span["text"] for span in spans}
print(extracted)
# {'CARRIER': 'Aetna', 'PLAN_TYPE': 'PPO', 'MEMBER_ID': 'AET-998-2210',
#  'GROUP_NUMBER': '4421', 'RELATIONSHIP': 'subscriber', 'COPAY': '$35'}
```

## Limitations

- **All training and evaluation data is synthetic.** No real patient billing data was used. Production deployment should include calibration against real-world inbound messages.
- **`AUTH_NUMBER` is materially weaker than frontier** (0.30 vs 0.99). Do not rely on this model alone for prior-auth extraction.
- **Format brittleness on structured IDs.** When a member ID, group number, or claim reference uses a format outside the training distribution, the model often produces a span boundary off by one token, which is a miss under strict matching.
- **English only, healthcare practice domain only.**
- **`RELATIONSHIP` accuracy is moderate** (0.70). The category overlaps with other named entities and short-span detection is fragile in cross-generator evaluation.
- **Benchmark scope.** Only GPT-4o was benchmarked in this release; Anthropic API credit was exhausted before Haiku 4.5 / Sonnet 4.6 could be run.

## Intended use

A first-pass insurance / billing field extractor in a hybrid intake pipeline for healthcare practice software. Strongest on high-volume fields (`CARRIER`, `MEMBER_ID`, `CLAIM_ID`, `SUBSCRIBER_NAME`). Should be paired with regex matchers (for dollar amounts and known ID formats) and a frontier-API fallback (for `AUTH_NUMBER` and `PLAN_TYPE` edge cases).

## Out-of-scope use

- **Sole reliance for prior-auth processing.** The model misses 70% of AUTH_NUMBERs on test data.
- **Direct downstream billing without validation.** Any extracted member ID, group number, or claim reference should be validated against the carrier's verification endpoint before being relied on for payment routing.
- **Adversarial inputs.** Not hardened against prompt injection or adversarial text crafted to manipulate extraction.

## Citation

```bibtex
@misc{sikder2026clarioscope_insurance,
  author = {Sikder, Akteruzzaman Raihan},
  title  = {ClarioScope insurance extractor v1: a 125M-parameter RoBERTa fine-tune for structured insurance and billing extraction},
  year   = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/raihan-js/clarioscope-insurance-v1}},
}
```

## Read more

The full ClarioScope SLM Suite writeup — methodology, cost ledger, and per-model interpretation — is on dev.to. Each model has its own post; the suite-level summary is at the GitHub profile [github.com/raihan-js](https://github.com/raihan-js).

## Author

Built by [Akteruzzaman Raihan Sikder](https://huggingface.co/raihan-js) — CTO, [ClarioScope AI](https://clarioscope.ai). Part of the broader ClarioScope SLM Suite (intent classifier, PHI detector, insurance extractor) — a three-model intake intelligence pipeline.