Token Classification
Transformers
Safetensors
English
roberta
ner
insurance
billing
healthcare
clarioscope
Eval Results (legacy)
Instructions to use raihan-js/clarioscope-insurance-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use raihan-js/clarioscope-insurance-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="raihan-js/clarioscope-insurance-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("raihan-js/clarioscope-insurance-v1") model = AutoModelForTokenClassification.from_pretrained("raihan-js/clarioscope-insurance-v1") - Notebooks
- Google Colab
- Kaggle
File size: 15,587 Bytes
945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 945cfdd b343316 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 | ---
library_name: transformers
license: mit
language:
- en
base_model: FacebookAI/roberta-base
pipeline_tag: token-classification
tags:
- token-classification
- ner
- insurance
- billing
- healthcare
- roberta
- clarioscope
metrics:
- f1
model-index:
- name: clarioscope-insurance-v1
results:
- task:
type: token-classification
name: Insurance / billing field extraction
dataset:
type: synthetic
name: clarioscope-insurance-suite
metrics:
- type: macro-f1
value: 0.7882
- type: weighted-f1
value: 0.8202
- type: latency_ms_per_example
value: 45.4
---
# ClarioScope insurance extractor v1
A 125M-parameter [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base) fine-tune that extracts **structured insurance and billing fields** from inbound patient text. It tags spans across 12 field types β carrier, plan type, member ID, group number, claim ID, copay, deductible, billed amount, and more β and produces output that downstream billing systems can ingest as JSON. This is **model 3 of the [ClarioScope SLM Suite](https://huggingface.co/raihan-js)** β a three-model intake intelligence pipeline for healthcare practices.
## TL;DR
| Property | Value |
|---|---|
| Task | Token classification over 12 insurance / billing field types (BIO tags, 25 labels) |
| Base model | [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) (125M params) |
| Training data | 8,086 synthetic examples (8,086 train / 897 val after cleanup) generated via `gpt-4o-mini-2024-07-18` |
| Test data | 672 synthetic examples generated by Claude Haiku 4.5 with a deliberately different prompt style |
| Val macro F1 | 95.71% |
| Val weighted F1 | 97.06% |
| **Test macro F1 vs GPT-4o** | **78.82%** vs gpt-4o-2024-11-20 95.62% |
| **Where fine-tune ties GPT-4o** | `SUBSCRIBER_NAME` (89 vs 91), `CLAIM_ID` (95 vs 100), `MEMBER_ID` (91 vs 99), `CARRIER` (90 vs 96) |
| **Where it loses badly** | `AUTH_NUMBER` (30 vs 99), `PLAN_TYPE` (69 vs 96) β low-frequency fields with high format variance |
| **Test latency** | **45.4 ms/example on CPU** β 26Γ faster than GPT-4o |
| **Per-inference cost** | **$0** self-hosted vs $1.90 per 1K for GPT-4o |
| License | MIT |
> **Note on benchmark scope.** This release benchmarks against GPT-4o only. Anthropic API credit was exhausted during the development cycle before the Claude Haiku 4.5 / Sonnet 4.6 comparisons could be run. A subsequent revision of this card will add the Anthropic numbers.

## Why this model exists
Patient inquiries about insurance, billing, and prior authorizations have a different shape from the rest of intake. The information value is in the structured fields β carrier name, member ID, group number, claim reference, copay amount β not in the surrounding prose. A practice's billing system can act on `{"carrier": "Aetna", "plan_type": "PPO", "member_id": "AET-998-2210"}` directly; it cannot act on "hi can you check if my Aetna PPO covers..." until those fields are extracted.
Frontier LLMs do this extraction well, with one consistent caveat β they cost real money per call, add ~1 second of latency, and send patient text to a third party. This model is the self-hosted alternative: 26Γ lower latency, $0 per inference after training, and patient text never leaves the host. It trails GPT-4o by 17 macro F1 points on aggregate, but it ties GPT-4o on the four fields with the most volume (`MEMBER_ID`, `CARRIER`, `CLAIM_ID`, `SUBSCRIBER_NAME`).
This is model 3 of the ClarioScope SLM Suite:
1. **Intent classifier** ([`clarioscope-intent-deberta-v1`](https://huggingface.co/raihan-js/clarioscope-intent-deberta-v1)) β what does this message want?
2. **PHI detector** ([`clarioscope-phi-deberta-v1`](https://huggingface.co/raihan-js/clarioscope-phi-deberta-v1)) β where is protected information in this message?
3. **Insurance extractor** (this model) β what billing-relevant structured data is in this message?
## The 12 field types
| Label | What it captures |
|---|---|
| `CARRIER` | Insurance company name (Aetna, Blue Cross Blue Shield, UnitedHealthcare, Cigna, Medicare, Medicaid, Kaiser Permanente, Humana, Anthem) |
| `PLAN_TYPE` | Plan format or tier (PPO, HMO, EPO, HDHP, POS, Gold / Silver / Bronze / Platinum) |
| `MEMBER_ID` | The patient's subscriber / member ID on the insurance card |
| `GROUP_NUMBER` | Group / employer group number |
| `POLICY_NUMBER` | Distinct policy or contract number when separate from member ID |
| `SUBSCRIBER_NAME` | Name of the primary subscriber on the policy |
| `RELATIONSHIP` | Relationship to subscriber (self, spouse, child, dependent, parent) |
| `CLAIM_ID` | Claim reference number |
| `AUTH_NUMBER` | Prior authorization / pre-cert number |
| `COPAY` | Copay dollar amount (includes the `$`) |
| `DEDUCTIBLE` | Deductible dollar amount (includes the `$`) |
| `BILLED_AMOUNT` | Total amount billed, owed, or charged (includes the `$`) |
The model outputs BIO token labels (25 total: `O` + 12 Γ `{B-, I-}`), which downstream code converts back into character-offset spans and then into a JSON object.
## Architecture
Standard RoBERTa-base encoder with a token-classification head: a linear layer over each token's contextualized representation, producing 25 logits per token. All 125M parameters are fine-tuned. Training uses fp32 (not mixed precision β see the PHI detector's card for the NaN-gradient story that motivated this), `adamw_torch` optimizer, `max_grad_norm=1.0`, an explicit classifier-head re-init (`std=0.02` normal, zero bias), batch size 8, sequence length 256, learning rate 1e-5 with cosine schedule and 10% warmup, weight decay 0.01, and early stopping (patience 2 on macro F1). Five epochs run in ~10 minutes on a single RTX A4000.
## Training data β synthetic, transparent about it
All training and evaluation data is **synthetic**. There is **no real patient billing data** in this model or its evaluation.
**Training set (8,086 examples after cleanup, from 8,120 originally).** Generated via the OpenAI API (`gpt-4o-mini-2024-07-18`, JSON-object response format, temperature 1.0) across 12 field types Γ healthcare practice types Γ channels. The generation prompt enforces a 40/40/20 realism mix (polished / casual / messy) and channel-specific scaling (SMS messiest, voicemail second messiest, email and web forms cleaner).
**Data cleanup.** Same cue-word noise pattern observed in the PHI detector: `gpt-4o-mini` sometimes returns entity texts that include the cue word (`"member ID AET-998-2210"` instead of `"AET-998-2210"`, `"copay $35"` instead of `"$35"`). `clean_data.py` in the repo strips these cue-word prefixes and re-locates the cleaned text in the source. 1,393 entities (~7.4% of all spans) were normalized during cleanup; 256 entities became empty after stripping and were dropped.
**Test set (672 examples).** Generated by Claude Haiku 4.5 with a deliberately different prompt style and abbreviation set (`w/`, `&`, `ins`, `mbr`, `grp` versus the train prompt's more formal style). This cross-generator split mitigates benchmark leakage from training the model on one generator's style.
**Anti-leakage validation.** The cross-generator split produces a large val/test gap (val macro F1 0.96; test macro F1 0.79). The val set comes from the same `gpt-4o-mini` distribution as training; the test set comes from a different generator. The 17-point gap is the cost of generalizing across generators β and a useful proxy for the gap that would exist between this model and a real-world distribution. Val numbers from same-generator splits routinely overstate real-world generalization on tasks like this.
## Benchmark β entity-level F1, span boundaries matter
Evaluated on the 672-example held-out test set using [seqeval](https://github.com/chakki-works/seqeval), which requires both the entity type AND the exact span boundary to match for a true positive.
| Model | Macro F1 | Weighted F1 | Latency / example | Cost / 1K inferences |
|---|---|---|---|---|
| **`raihan-js/clarioscope-insurance-v1` (CPU)** | **0.7882** | **0.8202** | **45.4 ms** | **$0.00** |
| `gpt-4o-2024-11-20` | 0.9562 | 0.9572 | 1202.3 ms | $1.90 |

GPT-4o wins on aggregate, but the per-entity breakdown shows the gap is concentrated in a few low-frequency fields.
## Per-field breakdown

**High-volume fields where fine-tune is competitive:**
| Field | Fine-tune | GPT-4o | Test support |
|---|---|---|---|
| `CLAIM_ID` | 0.954 | 0.997 | 146 |
| `MEMBER_ID` | 0.914 | 0.990 | 431 |
| `CARRIER` | 0.905 | 0.964 | 685 |
| `SUBSCRIBER_NAME` | 0.894 | 0.906 | 196 |
| `COPAY` | 0.864 | 0.954 | 179 |
| `BILLED_AMOUNT` | 0.847 | 0.971 | 219 |
`SUBSCRIBER_NAME` is essentially tied with GPT-4o (0.89 vs 0.91). The four highest-volume fields (`CARRIER`, `MEMBER_ID`, `CLAIM_ID`, `BILLED_AMOUNT`) collectively cover ~70% of the test entities, and the fine-tune is within 5β13 points of GPT-4o on each.
**Mid-volume fields with moderate gaps:**
| Field | Fine-tune | GPT-4o | Test support |
|---|---|---|---|
| `POLICY_NUMBER` | 0.827 | 0.985 | 97 |
| `GROUP_NUMBER` | 0.804 | 0.995 | 216 |
| `DEDUCTIBLE` | 0.758 | 0.964 | 190 |
`GROUP_NUMBER` has a 19-point gap despite 216 training-supported examples in test. The format variance for group numbers is wide (`4421`, `001428`, `GRP-882044`, `Group #44210`) and the cross-generator split exposes the train-set format bias.
**Low-volume / low-F1 fields:**
| Field | Fine-tune | GPT-4o | Test support |
|---|---|---|---|
| `RELATIONSHIP` | 0.703 | 0.802 | 207 |
| `PLAN_TYPE` | 0.688 | 0.956 | 359 |
| `AUTH_NUMBER` | 0.300 | 0.991 | 115 |
`AUTH_NUMBER` is the headline weakness β 0.30 vs 0.99. The training set has only 770 AUTH_NUMBER spans total, and the format space is wide (`PA-4421`, `auth #998-2210`, `AUTH998212`, etc.). A v2 with more AUTH_NUMBER coverage in training would likely close most of this gap.
`RELATIONSHIP` is a hard category for both models β short string ("self", "spouse", "child"), often overlapping with other entity contexts, with a tight span boundary that's easy to miss by one token.
## Recommended production pattern: hybrid
For a billing pipeline that processes inbound patient messages:
1. **Run this model first** on every message. Captures `CARRIER`, `MEMBER_ID`, `CLAIM_ID`, `SUBSCRIBER_NAME` with near-frontier F1 at ~45 ms / message on CPU.
2. **Add regex for high-value structured patterns**: dollar amounts (`\$[\d,]+(?:\.\d{2})?`), member ID format checks specific to your top carriers (Aetna IDs, BCBS IDs all follow distinct patterns).
3. **Use GPT-4o as a fallback** for messages where the fine-tune is uncertain or for AUTH_NUMBER / PLAN_TYPE detection. The fallback should fire on ~10β20% of messages, not 100%.
This pattern is the same architecture recommended for the PHI detector. The fine-tune does the bulk-volume linguistic work; frontier APIs handle the long tail. Together they cost an order of magnitude less than running frontier on every message.
## How to use
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_id = "raihan-js/clarioscope-insurance-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
model.eval()
text = ("Hi, I'd like to verify my coverage before tomorrow's appointment. "
"My carrier is Aetna PPO, member ID AET-998-2210, group #4421. "
"I'm the subscriber. My copay should be $35 β is that right?")
enc = tokenizer(text, return_offsets_mapping=True, return_tensors="pt", truncation=True, max_length=256)
offsets = enc.pop("offset_mapping")[0].tolist()
with torch.no_grad():
pred_ids = model(**enc).logits.argmax(dim=-1)[0].tolist()
id2label = model.config.id2label
spans = []
i = 0
while i < len(pred_ids):
label = id2label[pred_ids[i]]
if label.startswith("B-"):
ent_type = label[2:]
start = offsets[i][0]
end = offsets[i][1]
j = i + 1
while j < len(pred_ids) and id2label[pred_ids[j]] == f"I-{ent_type}":
end = offsets[j][1]
j += 1
spans.append({"text": text[start:end], "label": ent_type})
i = j
else:
i += 1
# Convert to structured JSON for downstream billing system
extracted = {span["label"]: span["text"] for span in spans}
print(extracted)
# {'CARRIER': 'Aetna', 'PLAN_TYPE': 'PPO', 'MEMBER_ID': 'AET-998-2210',
# 'GROUP_NUMBER': '4421', 'RELATIONSHIP': 'subscriber', 'COPAY': '$35'}
```
## Limitations
- **All training and evaluation data is synthetic.** No real patient billing data was used. Production deployment should include calibration against real-world inbound messages.
- **`AUTH_NUMBER` is materially weaker than frontier** (0.30 vs 0.99). Do not rely on this model alone for prior-auth extraction.
- **Format brittleness on structured IDs.** When a member ID, group number, or claim reference uses a format outside the training distribution, the model often produces a span boundary off by one token, which is a miss under strict matching.
- **English only, healthcare practice domain only.**
- **`RELATIONSHIP` accuracy is moderate** (0.70). The category overlaps with other named entities and short-span detection is fragile in cross-generator evaluation.
- **Benchmark scope.** Only GPT-4o was benchmarked in this release; Anthropic API credit was exhausted before Haiku 4.5 / Sonnet 4.6 could be run.
## Intended use
A first-pass insurance / billing field extractor in a hybrid intake pipeline for healthcare practice software. Strongest on high-volume fields (`CARRIER`, `MEMBER_ID`, `CLAIM_ID`, `SUBSCRIBER_NAME`). Should be paired with regex matchers (for dollar amounts and known ID formats) and a frontier-API fallback (for `AUTH_NUMBER` and `PLAN_TYPE` edge cases).
## Out-of-scope use
- **Sole reliance for prior-auth processing.** The model misses 70% of AUTH_NUMBERs on test data.
- **Direct downstream billing without validation.** Any extracted member ID, group number, or claim reference should be validated against the carrier's verification endpoint before being relied on for payment routing.
- **Adversarial inputs.** Not hardened against prompt injection or adversarial text crafted to manipulate extraction.
## Citation
```bibtex
@misc{sikder2026clarioscope_insurance,
author = {Sikder, Akteruzzaman Raihan},
title = {ClarioScope insurance extractor v1: a 125M-parameter RoBERTa fine-tune for structured insurance and billing extraction},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/raihan-js/clarioscope-insurance-v1}},
}
```
## Read more
The full ClarioScope SLM Suite writeup β methodology, cost ledger, and per-model interpretation β is on dev.to. Each model has its own post; the suite-level summary is at the GitHub profile [github.com/raihan-js](https://github.com/raihan-js).
## Author
Built by [Akteruzzaman Raihan Sikder](https://huggingface.co/raihan-js) β CTO, [ClarioScope AI](https://clarioscope.ai). Part of the broader ClarioScope SLM Suite (intent classifier, PHI detector, insurance extractor) β a three-model intake intelligence pipeline.
|