ClarioScope insurance extractor v1

A 125M-parameter RoBERTa-base fine-tune that extracts structured insurance and billing fields from inbound patient text. It tags spans across 12 field types โ€” carrier, plan type, member ID, group number, claim ID, copay, deductible, billed amount, and more โ€” and produces output that downstream billing systems can ingest as JSON. This is model 3 of the ClarioScope SLM Suite โ€” a three-model intake intelligence pipeline for healthcare practices.

TL;DR

Property Value
Task Token classification over 12 insurance / billing field types (BIO tags, 25 labels)
Base model FacebookAI/roberta-base (125M params)
Training data 8,086 synthetic examples (8,086 train / 897 val after cleanup) generated via gpt-4o-mini-2024-07-18
Test data 672 synthetic examples generated by Claude Haiku 4.5 with a deliberately different prompt style
Val macro F1 95.71%
Val weighted F1 97.06%
Test macro F1 vs GPT-4o 78.82% vs gpt-4o-2024-11-20 95.62%
Where fine-tune ties GPT-4o SUBSCRIBER_NAME (89 vs 91), CLAIM_ID (95 vs 100), MEMBER_ID (91 vs 99), CARRIER (90 vs 96)
Where it loses badly AUTH_NUMBER (30 vs 99), PLAN_TYPE (69 vs 96) โ€” low-frequency fields with high format variance
Test latency 45.4 ms/example on CPU โ€” 26ร— faster than GPT-4o
Per-inference cost $0 self-hosted vs $1.90 per 1K for GPT-4o
License MIT

Note on benchmark scope. This release benchmarks against GPT-4o only. Anthropic API credit was exhausted during the development cycle before the Claude Haiku 4.5 / Sonnet 4.6 comparisons could be run. A subsequent revision of this card will add the Anthropic numbers.

Macro F1 vs latency

Why this model exists

Patient inquiries about insurance, billing, and prior authorizations have a different shape from the rest of intake. The information value is in the structured fields โ€” carrier name, member ID, group number, claim reference, copay amount โ€” not in the surrounding prose. A practice's billing system can act on {"carrier": "Aetna", "plan_type": "PPO", "member_id": "AET-998-2210"} directly; it cannot act on "hi can you check if my Aetna PPO covers..." until those fields are extracted.

Frontier LLMs do this extraction well, with one consistent caveat โ€” they cost real money per call, add ~1 second of latency, and send patient text to a third party. This model is the self-hosted alternative: 26ร— lower latency, $0 per inference after training, and patient text never leaves the host. It trails GPT-4o by 17 macro F1 points on aggregate, but it ties GPT-4o on the four fields with the most volume (MEMBER_ID, CARRIER, CLAIM_ID, SUBSCRIBER_NAME).

This is model 3 of the ClarioScope SLM Suite:

  1. Intent classifier (clarioscope-intent-deberta-v1) โ€” what does this message want?
  2. PHI detector (clarioscope-phi-deberta-v1) โ€” where is protected information in this message?
  3. Insurance extractor (this model) โ€” what billing-relevant structured data is in this message?

The 12 field types

Label What it captures
CARRIER Insurance company name (Aetna, Blue Cross Blue Shield, UnitedHealthcare, Cigna, Medicare, Medicaid, Kaiser Permanente, Humana, Anthem)
PLAN_TYPE Plan format or tier (PPO, HMO, EPO, HDHP, POS, Gold / Silver / Bronze / Platinum)
MEMBER_ID The patient's subscriber / member ID on the insurance card
GROUP_NUMBER Group / employer group number
POLICY_NUMBER Distinct policy or contract number when separate from member ID
SUBSCRIBER_NAME Name of the primary subscriber on the policy
RELATIONSHIP Relationship to subscriber (self, spouse, child, dependent, parent)
CLAIM_ID Claim reference number
AUTH_NUMBER Prior authorization / pre-cert number
COPAY Copay dollar amount (includes the $)
DEDUCTIBLE Deductible dollar amount (includes the $)
BILLED_AMOUNT Total amount billed, owed, or charged (includes the $)

The model outputs BIO token labels (25 total: O + 12 ร— {B-, I-}), which downstream code converts back into character-offset spans and then into a JSON object.

Architecture

Standard RoBERTa-base encoder with a token-classification head: a linear layer over each token's contextualized representation, producing 25 logits per token. All 125M parameters are fine-tuned. Training uses fp32 (not mixed precision โ€” see the PHI detector's card for the NaN-gradient story that motivated this), adamw_torch optimizer, max_grad_norm=1.0, an explicit classifier-head re-init (std=0.02 normal, zero bias), batch size 8, sequence length 256, learning rate 1e-5 with cosine schedule and 10% warmup, weight decay 0.01, and early stopping (patience 2 on macro F1). Five epochs run in ~10 minutes on a single RTX A4000.

Training data โ€” synthetic, transparent about it

All training and evaluation data is synthetic. There is no real patient billing data in this model or its evaluation.

Training set (8,086 examples after cleanup, from 8,120 originally). Generated via the OpenAI API (gpt-4o-mini-2024-07-18, JSON-object response format, temperature 1.0) across 12 field types ร— healthcare practice types ร— channels. The generation prompt enforces a 40/40/20 realism mix (polished / casual / messy) and channel-specific scaling (SMS messiest, voicemail second messiest, email and web forms cleaner).

Data cleanup. Same cue-word noise pattern observed in the PHI detector: gpt-4o-mini sometimes returns entity texts that include the cue word ("member ID AET-998-2210" instead of "AET-998-2210", "copay $35" instead of "$35"). clean_data.py in the repo strips these cue-word prefixes and re-locates the cleaned text in the source. 1,393 entities (~7.4% of all spans) were normalized during cleanup; 256 entities became empty after stripping and were dropped.

Test set (672 examples). Generated by Claude Haiku 4.5 with a deliberately different prompt style and abbreviation set (w/, &, ins, mbr, grp versus the train prompt's more formal style). This cross-generator split mitigates benchmark leakage from training the model on one generator's style.

Anti-leakage validation. The cross-generator split produces a large val/test gap (val macro F1 0.96; test macro F1 0.79). The val set comes from the same gpt-4o-mini distribution as training; the test set comes from a different generator. The 17-point gap is the cost of generalizing across generators โ€” and a useful proxy for the gap that would exist between this model and a real-world distribution. Val numbers from same-generator splits routinely overstate real-world generalization on tasks like this.

Benchmark โ€” entity-level F1, span boundaries matter

Evaluated on the 672-example held-out test set using seqeval, which requires both the entity type AND the exact span boundary to match for a true positive.

Model Macro F1 Weighted F1 Latency / example Cost / 1K inferences
raihan-js/clarioscope-insurance-v1 (CPU) 0.7882 0.8202 45.4 ms $0.00
gpt-4o-2024-11-20 0.9562 0.9572 1202.3 ms $1.90

F1 and cost comparison

GPT-4o wins on aggregate, but the per-entity breakdown shows the gap is concentrated in a few low-frequency fields.

Per-field breakdown

Per-field F1: fine-tuned vs GPT-4o

High-volume fields where fine-tune is competitive:

Field Fine-tune GPT-4o Test support
CLAIM_ID 0.954 0.997 146
MEMBER_ID 0.914 0.990 431
CARRIER 0.905 0.964 685
SUBSCRIBER_NAME 0.894 0.906 196
COPAY 0.864 0.954 179
BILLED_AMOUNT 0.847 0.971 219

SUBSCRIBER_NAME is essentially tied with GPT-4o (0.89 vs 0.91). The four highest-volume fields (CARRIER, MEMBER_ID, CLAIM_ID, BILLED_AMOUNT) collectively cover ~70% of the test entities, and the fine-tune is within 5โ€“13 points of GPT-4o on each.

Mid-volume fields with moderate gaps:

Field Fine-tune GPT-4o Test support
POLICY_NUMBER 0.827 0.985 97
GROUP_NUMBER 0.804 0.995 216
DEDUCTIBLE 0.758 0.964 190

GROUP_NUMBER has a 19-point gap despite 216 training-supported examples in test. The format variance for group numbers is wide (4421, 001428, GRP-882044, Group #44210) and the cross-generator split exposes the train-set format bias.

Low-volume / low-F1 fields:

Field Fine-tune GPT-4o Test support
RELATIONSHIP 0.703 0.802 207
PLAN_TYPE 0.688 0.956 359
AUTH_NUMBER 0.300 0.991 115

AUTH_NUMBER is the headline weakness โ€” 0.30 vs 0.99. The training set has only 770 AUTH_NUMBER spans total, and the format space is wide (PA-4421, auth #998-2210, AUTH998212, etc.). A v2 with more AUTH_NUMBER coverage in training would likely close most of this gap.

RELATIONSHIP is a hard category for both models โ€” short string ("self", "spouse", "child"), often overlapping with other entity contexts, with a tight span boundary that's easy to miss by one token.

Recommended production pattern: hybrid

For a billing pipeline that processes inbound patient messages:

  1. Run this model first on every message. Captures CARRIER, MEMBER_ID, CLAIM_ID, SUBSCRIBER_NAME with near-frontier F1 at ~45 ms / message on CPU.
  2. Add regex for high-value structured patterns: dollar amounts (\$[\d,]+(?:\.\d{2})?), member ID format checks specific to your top carriers (Aetna IDs, BCBS IDs all follow distinct patterns).
  3. Use GPT-4o as a fallback for messages where the fine-tune is uncertain or for AUTH_NUMBER / PLAN_TYPE detection. The fallback should fire on ~10โ€“20% of messages, not 100%.

This pattern is the same architecture recommended for the PHI detector. The fine-tune does the bulk-volume linguistic work; frontier APIs handle the long tail. Together they cost an order of magnitude less than running frontier on every message.

How to use

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_id = "raihan-js/clarioscope-insurance-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
model.eval()

text = ("Hi, I'd like to verify my coverage before tomorrow's appointment. "
        "My carrier is Aetna PPO, member ID AET-998-2210, group #4421. "
        "I'm the subscriber. My copay should be $35 โ€” is that right?")

enc = tokenizer(text, return_offsets_mapping=True, return_tensors="pt", truncation=True, max_length=256)
offsets = enc.pop("offset_mapping")[0].tolist()
with torch.no_grad():
    pred_ids = model(**enc).logits.argmax(dim=-1)[0].tolist()

id2label = model.config.id2label
spans = []
i = 0
while i < len(pred_ids):
    label = id2label[pred_ids[i]]
    if label.startswith("B-"):
        ent_type = label[2:]
        start = offsets[i][0]
        end = offsets[i][1]
        j = i + 1
        while j < len(pred_ids) and id2label[pred_ids[j]] == f"I-{ent_type}":
            end = offsets[j][1]
            j += 1
        spans.append({"text": text[start:end], "label": ent_type})
        i = j
    else:
        i += 1

# Convert to structured JSON for downstream billing system
extracted = {span["label"]: span["text"] for span in spans}
print(extracted)
# {'CARRIER': 'Aetna', 'PLAN_TYPE': 'PPO', 'MEMBER_ID': 'AET-998-2210',
#  'GROUP_NUMBER': '4421', 'RELATIONSHIP': 'subscriber', 'COPAY': '$35'}

Limitations

  • All training and evaluation data is synthetic. No real patient billing data was used. Production deployment should include calibration against real-world inbound messages.
  • AUTH_NUMBER is materially weaker than frontier (0.30 vs 0.99). Do not rely on this model alone for prior-auth extraction.
  • Format brittleness on structured IDs. When a member ID, group number, or claim reference uses a format outside the training distribution, the model often produces a span boundary off by one token, which is a miss under strict matching.
  • English only, healthcare practice domain only.
  • RELATIONSHIP accuracy is moderate (0.70). The category overlaps with other named entities and short-span detection is fragile in cross-generator evaluation.
  • Benchmark scope. Only GPT-4o was benchmarked in this release; Anthropic API credit was exhausted before Haiku 4.5 / Sonnet 4.6 could be run.

Intended use

A first-pass insurance / billing field extractor in a hybrid intake pipeline for healthcare practice software. Strongest on high-volume fields (CARRIER, MEMBER_ID, CLAIM_ID, SUBSCRIBER_NAME). Should be paired with regex matchers (for dollar amounts and known ID formats) and a frontier-API fallback (for AUTH_NUMBER and PLAN_TYPE edge cases).

Out-of-scope use

  • Sole reliance for prior-auth processing. The model misses 70% of AUTH_NUMBERs on test data.
  • Direct downstream billing without validation. Any extracted member ID, group number, or claim reference should be validated against the carrier's verification endpoint before being relied on for payment routing.
  • Adversarial inputs. Not hardened against prompt injection or adversarial text crafted to manipulate extraction.

Citation

@misc{sikder2026clarioscope_insurance,
  author = {Sikder, Akteruzzaman Raihan},
  title  = {ClarioScope insurance extractor v1: a 125M-parameter RoBERTa fine-tune for structured insurance and billing extraction},
  year   = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/raihan-js/clarioscope-insurance-v1}},
}

Read more

The full ClarioScope SLM Suite writeup โ€” methodology, cost ledger, and per-model interpretation โ€” is on dev.to. Each model has its own post; the suite-level summary is at the GitHub profile github.com/raihan-js.

Author

Built by Akteruzzaman Raihan Sikder โ€” CTO, ClarioScope AI. Part of the broader ClarioScope SLM Suite (intent classifier, PHI detector, insurance extractor) โ€” a three-model intake intelligence pipeline.

Downloads last month
7
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for raihan-js/clarioscope-insurance-v1

Finetuned
(2360)
this model

Evaluation results

  • macro-f1 on clarioscope-insurance-suite
    self-reported
    0.788
  • weighted-f1 on clarioscope-insurance-suite
    self-reported
    0.820
  • latency_ms_per_example on clarioscope-insurance-suite
    self-reported
    45.400