ClarioScope insurance extractor v1

A 125M-parameter RoBERTa-base fine-tune that extracts structured insurance and billing fields from inbound patient text. It tags spans across 12 field types — carrier, plan type, member ID, group number, claim ID, copay, deductible, billed amount, and more — and produces output that downstream billing systems can ingest as JSON. This is model 3 of the ClarioScope SLM Suite — a three-model intake intelligence pipeline for healthcare practices.

TL;DR

Property	Value
Task	Token classification over 12 insurance / billing field types (BIO tags, 25 labels)
Base model	FacebookAI/roberta-base (125M params)
Training data	8,086 synthetic examples (8,086 train / 897 val after cleanup) generated via `gpt-4o-mini-2024-07-18`
Test data	672 synthetic examples generated by Claude Haiku 4.5 with a deliberately different prompt style
Val macro F1	95.71%
Val weighted F1	97.06%
Test macro F1 vs GPT-4o	78.82% vs gpt-4o-2024-11-20 95.62%
Where fine-tune ties GPT-4o	`SUBSCRIBER_NAME` (89 vs 91), `CLAIM_ID` (95 vs 100), `MEMBER_ID` (91 vs 99), `CARRIER` (90 vs 96)
Where it loses badly	`AUTH_NUMBER` (30 vs 99), `PLAN_TYPE` (69 vs 96) — low-frequency fields with high format variance
Test latency	45.4 ms/example on CPU — 26× faster than GPT-4o
Per-inference cost	$0 self-hosted vs $1.90 per 1K for GPT-4o
License	MIT

Note on benchmark scope. This release benchmarks against GPT-4o only. Anthropic API credit was exhausted during the development cycle before the Claude Haiku 4.5 / Sonnet 4.6 comparisons could be run. A subsequent revision of this card will add the Anthropic numbers.

Why this model exists

Patient inquiries about insurance, billing, and prior authorizations have a different shape from the rest of intake. The information value is in the structured fields — carrier name, member ID, group number, claim reference, copay amount — not in the surrounding prose. A practice's billing system can act on {"carrier": "Aetna", "plan_type": "PPO", "member_id": "AET-998-2210"} directly; it cannot act on "hi can you check if my Aetna PPO covers..." until those fields are extracted.

Frontier LLMs do this extraction well, with one consistent caveat — they cost real money per call, add ~1 second of latency, and send patient text to a third party. This model is the self-hosted alternative: 26× lower latency, $0 per inference after training, and patient text never leaves the host. It trails GPT-4o by 17 macro F1 points on aggregate, but it ties GPT-4o on the four fields with the most volume (MEMBER_ID, CARRIER, CLAIM_ID, SUBSCRIBER_NAME).

This is model 3 of the ClarioScope SLM Suite:

Intent classifier (clarioscope-intent-deberta-v1) — what does this message want?
PHI detector (clarioscope-phi-deberta-v1) — where is protected information in this message?
Insurance extractor (this model) — what billing-relevant structured data is in this message?

The 12 field types

Label	What it captures
`CARRIER`	Insurance company name (Aetna, Blue Cross Blue Shield, UnitedHealthcare, Cigna, Medicare, Medicaid, Kaiser Permanente, Humana, Anthem)
`PLAN_TYPE`	Plan format or tier (PPO, HMO, EPO, HDHP, POS, Gold / Silver / Bronze / Platinum)
`MEMBER_ID`	The patient's subscriber / member ID on the insurance card
`GROUP_NUMBER`	Group / employer group number
`POLICY_NUMBER`	Distinct policy or contract number when separate from member ID
`SUBSCRIBER_NAME`	Name of the primary subscriber on the policy
`RELATIONSHIP`	Relationship to subscriber (self, spouse, child, dependent, parent)
`CLAIM_ID`	Claim reference number
`AUTH_NUMBER`	Prior authorization / pre-cert number
`COPAY`	Copay dollar amount (includes the `$`)
`DEDUCTIBLE`	Deductible dollar amount (includes the `$`)
`BILLED_AMOUNT`	Total amount billed, owed, or charged (includes the `$`)

The model outputs BIO token labels (25 total: O + 12 × {B-, I-}), which downstream code converts back into character-offset spans and then into a JSON object.

Architecture

Standard RoBERTa-base encoder with a token-classification head: a linear layer over each token's contextualized representation, producing 25 logits per token. All 125M parameters are fine-tuned. Training uses fp32 (not mixed precision — see the PHI detector's card for the NaN-gradient story that motivated this), adamw_torch optimizer, max_grad_norm=1.0, an explicit classifier-head re-init (std=0.02 normal, zero bias), batch size 8, sequence length 256, learning rate 1e-5 with cosine schedule and 10% warmup, weight decay 0.01, and early stopping (patience 2 on macro F1). Five epochs run in ~10 minutes on a single RTX A4000.

Training data — synthetic, transparent about it

All training and evaluation data is synthetic. There is no real patient billing data in this model or its evaluation.

Training set (8,086 examples after cleanup, from 8,120 originally). Generated via the OpenAI API (gpt-4o-mini-2024-07-18, JSON-object response format, temperature 1.0) across 12 field types × healthcare practice types × channels. The generation prompt enforces a 40/40/20 realism mix (polished / casual / messy) and channel-specific scaling (SMS messiest, voicemail second messiest, email and web forms cleaner).

Data cleanup. Same cue-word noise pattern observed in the PHI detector: gpt-4o-mini sometimes returns entity texts that include the cue word ("member ID AET-998-2210" instead of "AET-998-2210", "copay $35" instead of "$35"). clean_data.py in the repo strips these cue-word prefixes and re-locates the cleaned text in the source. 1,393 entities (~7.4% of all spans) were normalized during cleanup; 256 entities became empty after stripping and were dropped.

Test set (672 examples). Generated by Claude Haiku 4.5 with a deliberately different prompt style and abbreviation set (w/, &, ins, mbr, grp versus the train prompt's more formal style). This cross-generator split mitigates benchmark leakage from training the model on one generator's style.

Anti-leakage validation. The cross-generator split produces a large val/test gap (val macro F1 0.96; test macro F1 0.79). The val set comes from the same gpt-4o-mini distribution as training; the test set comes from a different generator. The 17-point gap is the cost of generalizing across generators — and a useful proxy for the gap that would exist between this model and a real-world distribution. Val numbers from same-generator splits routinely overstate real-world generalization on tasks like this.

Benchmark — entity-level F1, span boundaries matter

Evaluated on the 672-example held-out test set using seqeval, which requires both the entity type AND the exact span boundary to match for a true positive.

Model	Macro F1	Weighted F1	Latency / example	Cost / 1K inferences
`raihan-js/clarioscope-insurance-v1` (CPU)	0.7882	0.8202	45.4 ms	$0.00
`gpt-4o-2024-11-20`	0.9562	0.9572	1202.3 ms	$1.90

GPT-4o wins on aggregate, but the per-entity breakdown shows the gap is concentrated in a few low-frequency fields.

Per-field breakdown

High-volume fields where fine-tune is competitive:

Field	Fine-tune	GPT-4o	Test support
`CLAIM_ID`	0.954	0.997	146
`MEMBER_ID`	0.914	0.990	431
`CARRIER`	0.905	0.964	685
`SUBSCRIBER_NAME`	0.894	0.906	196
`COPAY`	0.864	0.954	179
`BILLED_AMOUNT`	0.847	0.971	219

SUBSCRIBER_NAME is essentially tied with GPT-4o (0.89 vs 0.91). The four highest-volume fields (CARRIER, MEMBER_ID, CLAIM_ID, BILLED_AMOUNT) collectively cover ~70% of the test entities, and the fine-tune is within 5–13 points of GPT-4o on each.

Mid-volume fields with moderate gaps:

Field	Fine-tune	GPT-4o	Test support
`POLICY_NUMBER`	0.827	0.985	97
`GROUP_NUMBER`	0.804	0.995	216
`DEDUCTIBLE`	0.758	0.964	190

GROUP_NUMBER has a 19-point gap despite 216 training-supported examples in test. The format variance for group numbers is wide (4421, 001428, GRP-882044, Group #44210) and the cross-generator split exposes the train-set format bias.

Low-volume / low-F1 fields:

Field	Fine-tune	GPT-4o	Test support
`RELATIONSHIP`	0.703	0.802	207
`PLAN_TYPE`	0.688	0.956	359
`AUTH_NUMBER`	0.300	0.991	115

AUTH_NUMBER is the headline weakness — 0.30 vs 0.99. The training set has only 770 AUTH_NUMBER spans total, and the format space is wide (PA-4421, auth #998-2210, AUTH998212, etc.). A v2 with more AUTH_NUMBER coverage in training would likely close most of this gap.

RELATIONSHIP is a hard category for both models — short string ("self", "spouse", "child"), often overlapping with other entity contexts, with a tight span boundary that's easy to miss by one token.

Recommended production pattern: hybrid

For a billing pipeline that processes inbound patient messages:

Run this model first on every message. Captures CARRIER, MEMBER_ID, CLAIM_ID, SUBSCRIBER_NAME with near-frontier F1 at ~45 ms / message on CPU.
Add regex for high-value structured patterns: dollar amounts (\$[\d,]+(?:\.\d{2})?), member ID format checks specific to your top carriers (Aetna IDs, BCBS IDs all follow distinct patterns).
Use GPT-4o as a fallback for messages where the fine-tune is uncertain or for AUTH_NUMBER / PLAN_TYPE detection. The fallback should fire on ~10–20% of messages, not 100%.

This pattern is the same architecture recommended for the PHI detector. The fine-tune does the bulk-volume linguistic work; frontier APIs handle the long tail. Together they cost an order of magnitude less than running frontier on every message.

How to use

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_id = "raihan-js/clarioscope-insurance-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
model.eval()

text = ("Hi, I'd like to verify my coverage before tomorrow's appointment. "
        "My carrier is Aetna PPO, member ID AET-998-2210, group #4421. "
        "I'm the subscriber. My copay should be $35 — is that right?")

enc = tokenizer(text, return_offsets_mapping=True, return_tensors="pt", truncation=True, max_length=256)
offsets = enc.pop("offset_mapping")[0].tolist()
with torch.no_grad():
    pred_ids = model(**enc).logits.argmax(dim=-1)[0].tolist()

id2label = model.config.id2label
spans = []
i = 0
while i < len(pred_ids):
    label = id2label[pred_ids[i]]
    if label.startswith("B-"):
        ent_type = label[2:]
        start = offsets[i][0]
        end = offsets[i][1]
        j = i + 1
        while j < len(pred_ids) and id2label[pred_ids[j]] == f"I-{ent_type}":
            end = offsets[j][1]
            j += 1
        spans.append({"text": text[start:end], "label": ent_type})
        i = j
    else:
        i += 1

# Convert to structured JSON for downstream billing system
extracted = {span["label"]: span["text"] for span in spans}
print(extracted)
# {'CARRIER': 'Aetna', 'PLAN_TYPE': 'PPO', 'MEMBER_ID': 'AET-998-2210',
#  'GROUP_NUMBER': '4421', 'RELATIONSHIP': 'subscriber', 'COPAY': '$35'}

Limitations

All training and evaluation data is synthetic. No real patient billing data was used. Production deployment should include calibration against real-world inbound messages.
AUTH_NUMBER is materially weaker than frontier (0.30 vs 0.99). Do not rely on this model alone for prior-auth extraction.
Format brittleness on structured IDs. When a member ID, group number, or claim reference uses a format outside the training distribution, the model often produces a span boundary off by one token, which is a miss under strict matching.
English only, healthcare practice domain only.
RELATIONSHIP accuracy is moderate (0.70). The category overlaps with other named entities and short-span detection is fragile in cross-generator evaluation.
Benchmark scope. Only GPT-4o was benchmarked in this release; Anthropic API credit was exhausted before Haiku 4.5 / Sonnet 4.6 could be run.

Intended use

A first-pass insurance / billing field extractor in a hybrid intake pipeline for healthcare practice software. Strongest on high-volume fields (CARRIER, MEMBER_ID, CLAIM_ID, SUBSCRIBER_NAME). Should be paired with regex matchers (for dollar amounts and known ID formats) and a frontier-API fallback (for AUTH_NUMBER and PLAN_TYPE edge cases).

Out-of-scope use

Sole reliance for prior-auth processing. The model misses 70% of AUTH_NUMBERs on test data.
Direct downstream billing without validation. Any extracted member ID, group number, or claim reference should be validated against the carrier's verification endpoint before being relied on for payment routing.
Adversarial inputs. Not hardened against prompt injection or adversarial text crafted to manipulate extraction.

Citation

@misc{sikder2026clarioscope_insurance,
  author = {Sikder, Akteruzzaman Raihan},
  title  = {ClarioScope insurance extractor v1: a 125M-parameter RoBERTa fine-tune for structured insurance and billing extraction},
  year   = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/raihan-js/clarioscope-insurance-v1}},
}

The full ClarioScope SLM Suite writeup — methodology, cost ledger, and per-model interpretation — is on dev.to. Each model has its own post; the suite-level summary is at the GitHub profile github.com/raihan-js.

Author

Built by Akteruzzaman Raihan Sikder — CTO, ClarioScope AI. Part of the broader ClarioScope SLM Suite (intent classifier, PHI detector, insurance extractor) — a three-model intake intelligence pipeline.

Downloads last month: 7

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for raihan-js/clarioscope-insurance-v1

Base model

FacebookAI/roberta-base

Finetuned

(2360)

this model

Evaluation results

macro-f1 on clarioscope-insurance-suite
self-reported

0.788
weighted-f1 on clarioscope-insurance-suite
self-reported

0.820
latency_ms_per_example on clarioscope-insurance-suite
self-reported

45.400

raihan-js
/

clarioscope-insurance-v1