Token Classification
Transformers
Safetensors
English
roberta
ner
insurance
billing
healthcare
clarioscope
Eval Results (legacy)
Instructions to use raihan-js/clarioscope-insurance-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use raihan-js/clarioscope-insurance-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="raihan-js/clarioscope-insurance-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("raihan-js/clarioscope-insurance-v1") model = AutoModelForTokenClassification.from_pretrained("raihan-js/clarioscope-insurance-v1") - Notebooks
- Google Colab
- Kaggle
Add README.md
Browse files
README.md
CHANGED
|
@@ -1,65 +1,252 @@
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
license: mit
|
|
|
|
|
|
|
| 4 |
base_model: FacebookAI/roberta-base
|
|
|
|
| 5 |
tags:
|
| 6 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
model-index:
|
| 8 |
- name: clarioscope-insurance-v1
|
| 9 |
-
results:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
| 13 |
-
should probably proofread and complete it, then remove this comment. -->
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
|
| 18 |
-
It achieves the following results on the evaluation set:
|
| 19 |
-
- Loss: 0.0312
|
| 20 |
-
- Macro F1: 0.9571
|
| 21 |
-
- Weighted F1: 0.9706
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
-
|
|
|
|
|
|
|
| 38 |
|
| 39 |
-
The
|
| 40 |
-
- learning_rate: 1e-05
|
| 41 |
-
- train_batch_size: 8
|
| 42 |
-
- eval_batch_size: 16
|
| 43 |
-
- seed: 42
|
| 44 |
-
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
|
| 45 |
-
- lr_scheduler_type: cosine
|
| 46 |
-
- lr_scheduler_warmup_steps: 0.1
|
| 47 |
-
- num_epochs: 5
|
| 48 |
|
| 49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
-
|
| 52 |
-
|:-------------:|:-----:|:----:|:---------------:|:--------:|:-----------:|
|
| 53 |
-
| 0.0304 | 1.0 | 1011 | 0.0343 | 0.9486 | 0.9624 |
|
| 54 |
-
| 0.0147 | 2.0 | 2022 | 0.0345 | 0.9537 | 0.9682 |
|
| 55 |
-
| 0.0223 | 3.0 | 3033 | 0.0312 | 0.9571 | 0.9706 |
|
| 56 |
-
| 0.0082 | 4.0 | 4044 | 0.0325 | 0.9562 | 0.9704 |
|
| 57 |
-
| 0.0152 | 5.0 | 5055 | 0.0328 | 0.9563 | 0.9706 |
|
| 58 |
|
|
|
|
| 59 |
|
| 60 |
-
|
| 61 |
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
license: mit
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
base_model: FacebookAI/roberta-base
|
| 7 |
+
pipeline_tag: token-classification
|
| 8 |
tags:
|
| 9 |
+
- token-classification
|
| 10 |
+
- ner
|
| 11 |
+
- insurance
|
| 12 |
+
- billing
|
| 13 |
+
- healthcare
|
| 14 |
+
- roberta
|
| 15 |
+
- clarioscope
|
| 16 |
+
metrics:
|
| 17 |
+
- f1
|
| 18 |
model-index:
|
| 19 |
- name: clarioscope-insurance-v1
|
| 20 |
+
results:
|
| 21 |
+
- task:
|
| 22 |
+
type: token-classification
|
| 23 |
+
name: Insurance / billing field extraction
|
| 24 |
+
dataset:
|
| 25 |
+
type: synthetic
|
| 26 |
+
name: clarioscope-insurance-suite
|
| 27 |
+
metrics:
|
| 28 |
+
- type: macro-f1
|
| 29 |
+
value: 0.7882
|
| 30 |
+
- type: weighted-f1
|
| 31 |
+
value: 0.8202
|
| 32 |
+
- type: latency_ms_per_example
|
| 33 |
+
value: 45.4
|
| 34 |
---
|
| 35 |
|
| 36 |
+
# ClarioScope insurance extractor v1
|
|
|
|
| 37 |
|
| 38 |
+
A 125M-parameter [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base) fine-tune that extracts **structured insurance and billing fields** from inbound patient text. It tags spans across 12 field types β carrier, plan type, member ID, group number, claim ID, copay, deductible, billed amount, and more β and produces output that downstream billing systems can ingest as JSON. This is **model 3 of the [ClarioScope SLM Suite](https://huggingface.co/raihan-js)** β a three-model intake intelligence pipeline for healthcare practices.
|
| 39 |
|
| 40 |
+
## TL;DR
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
+
| Property | Value |
|
| 43 |
+
|---|---|
|
| 44 |
+
| Task | Token classification over 12 insurance / billing field types (BIO tags, 25 labels) |
|
| 45 |
+
| Base model | [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) (125M params) |
|
| 46 |
+
| Training data | 8,086 synthetic examples (8,086 train / 897 val after cleanup) generated via `gpt-4o-mini-2024-07-18` |
|
| 47 |
+
| Test data | 672 synthetic examples generated by Claude Haiku 4.5 with a deliberately different prompt style |
|
| 48 |
+
| Val macro F1 | 95.71% |
|
| 49 |
+
| Val weighted F1 | 97.06% |
|
| 50 |
+
| **Test macro F1 vs GPT-4o** | **78.82%** vs gpt-4o-2024-11-20 95.62% |
|
| 51 |
+
| **Where fine-tune ties GPT-4o** | `SUBSCRIBER_NAME` (89 vs 91), `CLAIM_ID` (95 vs 100), `MEMBER_ID` (91 vs 99), `CARRIER` (90 vs 96) |
|
| 52 |
+
| **Where it loses badly** | `AUTH_NUMBER` (30 vs 99), `PLAN_TYPE` (69 vs 96) β low-frequency fields with high format variance |
|
| 53 |
+
| **Test latency** | **45.4 ms/example on CPU** β 26Γ faster than GPT-4o |
|
| 54 |
+
| **Per-inference cost** | **$0** self-hosted vs $1.90 per 1K for GPT-4o |
|
| 55 |
+
| License | MIT |
|
| 56 |
|
| 57 |
+
> **Note on benchmark scope.** This release benchmarks against GPT-4o only. Anthropic API credit was exhausted during the development cycle before the Claude Haiku 4.5 / Sonnet 4.6 comparisons could be run. A subsequent revision of this card will add the Anthropic numbers.
|
| 58 |
|
| 59 |
+

|
| 60 |
|
| 61 |
+
## Why this model exists
|
| 62 |
|
| 63 |
+
Patient inquiries about insurance, billing, and prior authorizations have a different shape from the rest of intake. The information value is in the structured fields β carrier name, member ID, group number, claim reference, copay amount β not in the surrounding prose. A practice's billing system can act on `{"carrier": "Aetna", "plan_type": "PPO", "member_id": "AET-998-2210"}` directly; it cannot act on "hi can you check if my Aetna PPO covers..." until those fields are extracted.
|
| 64 |
|
| 65 |
+
Frontier LLMs do this extraction well, with one consistent caveat β they cost real money per call, add ~1 second of latency, and send patient text to a third party. This model is the self-hosted alternative: 26Γ lower latency, $0 per inference after training, and patient text never leaves the host. It trails GPT-4o by 17 macro F1 points on aggregate, but it ties GPT-4o on the four fields with the most volume (`MEMBER_ID`, `CARRIER`, `CLAIM_ID`, `SUBSCRIBER_NAME`).
|
| 66 |
|
| 67 |
+
This is model 3 of the ClarioScope SLM Suite:
|
| 68 |
|
| 69 |
+
1. **Intent classifier** ([`clarioscope-intent-deberta-v1`](https://huggingface.co/raihan-js/clarioscope-intent-deberta-v1)) β what does this message want?
|
| 70 |
+
2. **PHI detector** ([`clarioscope-phi-deberta-v1`](https://huggingface.co/raihan-js/clarioscope-phi-deberta-v1)) β where is protected information in this message?
|
| 71 |
+
3. **Insurance extractor** (this model) β what billing-relevant structured data is in this message?
|
| 72 |
|
| 73 |
+
## The 12 field types
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
+
| Label | What it captures |
|
| 76 |
+
|---|---|
|
| 77 |
+
| `CARRIER` | Insurance company name (Aetna, Blue Cross Blue Shield, UnitedHealthcare, Cigna, Medicare, Medicaid, Kaiser Permanente, Humana, Anthem) |
|
| 78 |
+
| `PLAN_TYPE` | Plan format or tier (PPO, HMO, EPO, HDHP, POS, Gold / Silver / Bronze / Platinum) |
|
| 79 |
+
| `MEMBER_ID` | The patient's subscriber / member ID on the insurance card |
|
| 80 |
+
| `GROUP_NUMBER` | Group / employer group number |
|
| 81 |
+
| `POLICY_NUMBER` | Distinct policy or contract number when separate from member ID |
|
| 82 |
+
| `SUBSCRIBER_NAME` | Name of the primary subscriber on the policy |
|
| 83 |
+
| `RELATIONSHIP` | Relationship to subscriber (self, spouse, child, dependent, parent) |
|
| 84 |
+
| `CLAIM_ID` | Claim reference number |
|
| 85 |
+
| `AUTH_NUMBER` | Prior authorization / pre-cert number |
|
| 86 |
+
| `COPAY` | Copay dollar amount (includes the `$`) |
|
| 87 |
+
| `DEDUCTIBLE` | Deductible dollar amount (includes the `$`) |
|
| 88 |
+
| `BILLED_AMOUNT` | Total amount billed, owed, or charged (includes the `$`) |
|
| 89 |
|
| 90 |
+
The model outputs BIO token labels (25 total: `O` + 12 Γ `{B-, I-}`), which downstream code converts back into character-offset spans and then into a JSON object.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
+
## Architecture
|
| 93 |
|
| 94 |
+
Standard RoBERTa-base encoder with a token-classification head: a linear layer over each token's contextualized representation, producing 25 logits per token. All 125M parameters are fine-tuned. Training uses fp32 (not mixed precision β see the PHI detector's card for the NaN-gradient story that motivated this), `adamw_torch` optimizer, `max_grad_norm=1.0`, an explicit classifier-head re-init (`std=0.02` normal, zero bias), batch size 8, sequence length 256, learning rate 1e-5 with cosine schedule and 10% warmup, weight decay 0.01, and early stopping (patience 2 on macro F1). Five epochs run in ~10 minutes on a single RTX A4000.
|
| 95 |
|
| 96 |
+
## Training data β synthetic, transparent about it
|
| 97 |
+
|
| 98 |
+
All training and evaluation data is **synthetic**. There is **no real patient billing data** in this model or its evaluation.
|
| 99 |
+
|
| 100 |
+
**Training set (8,086 examples after cleanup, from 8,120 originally).** Generated via the OpenAI API (`gpt-4o-mini-2024-07-18`, JSON-object response format, temperature 1.0) across 12 field types Γ healthcare practice types Γ channels. The generation prompt enforces a 40/40/20 realism mix (polished / casual / messy) and channel-specific scaling (SMS messiest, voicemail second messiest, email and web forms cleaner).
|
| 101 |
+
|
| 102 |
+
**Data cleanup.** Same cue-word noise pattern observed in the PHI detector: `gpt-4o-mini` sometimes returns entity texts that include the cue word (`"member ID AET-998-2210"` instead of `"AET-998-2210"`, `"copay $35"` instead of `"$35"`). `clean_data.py` in the repo strips these cue-word prefixes and re-locates the cleaned text in the source. 1,393 entities (~7.4% of all spans) were normalized during cleanup; 256 entities became empty after stripping and were dropped.
|
| 103 |
+
|
| 104 |
+
**Test set (672 examples).** Generated by Claude Haiku 4.5 with a deliberately different prompt style and abbreviation set (`w/`, `&`, `ins`, `mbr`, `grp` versus the train prompt's more formal style). This cross-generator split mitigates benchmark leakage from training the model on one generator's style.
|
| 105 |
+
|
| 106 |
+
**Anti-leakage validation.** The cross-generator split produces a large val/test gap (val macro F1 0.96; test macro F1 0.79). The val set comes from the same `gpt-4o-mini` distribution as training; the test set comes from a different generator. The 17-point gap is the cost of generalizing across generators β and a useful proxy for the gap that would exist between this model and a real-world distribution. Val numbers from same-generator splits routinely overstate real-world generalization on tasks like this.
|
| 107 |
+
|
| 108 |
+
## Benchmark β entity-level F1, span boundaries matter
|
| 109 |
+
|
| 110 |
+
Evaluated on the 672-example held-out test set using [seqeval](https://github.com/chakki-works/seqeval), which requires both the entity type AND the exact span boundary to match for a true positive.
|
| 111 |
+
|
| 112 |
+
| Model | Macro F1 | Weighted F1 | Latency / example | Cost / 1K inferences |
|
| 113 |
+
|---|---|---|---|---|
|
| 114 |
+
| **`raihan-js/clarioscope-insurance-v1` (CPU)** | **0.7882** | **0.8202** | **45.4 ms** | **$0.00** |
|
| 115 |
+
| `gpt-4o-2024-11-20` | 0.9562 | 0.9572 | 1202.3 ms | $1.90 |
|
| 116 |
+
|
| 117 |
+

|
| 118 |
+
|
| 119 |
+
GPT-4o wins on aggregate, but the per-entity breakdown shows the gap is concentrated in a few low-frequency fields.
|
| 120 |
+
|
| 121 |
+
## Per-field breakdown
|
| 122 |
+
|
| 123 |
+

|
| 124 |
+
|
| 125 |
+
**High-volume fields where fine-tune is competitive:**
|
| 126 |
+
|
| 127 |
+
| Field | Fine-tune | GPT-4o | Test support |
|
| 128 |
+
|---|---|---|---|
|
| 129 |
+
| `CLAIM_ID` | 0.954 | 0.997 | 146 |
|
| 130 |
+
| `MEMBER_ID` | 0.914 | 0.990 | 431 |
|
| 131 |
+
| `CARRIER` | 0.905 | 0.964 | 685 |
|
| 132 |
+
| `SUBSCRIBER_NAME` | 0.894 | 0.906 | 196 |
|
| 133 |
+
| `COPAY` | 0.864 | 0.954 | 179 |
|
| 134 |
+
| `BILLED_AMOUNT` | 0.847 | 0.971 | 219 |
|
| 135 |
+
|
| 136 |
+
`SUBSCRIBER_NAME` is essentially tied with GPT-4o (0.89 vs 0.91). The four highest-volume fields (`CARRIER`, `MEMBER_ID`, `CLAIM_ID`, `BILLED_AMOUNT`) collectively cover ~70% of the test entities, and the fine-tune is within 5β13 points of GPT-4o on each.
|
| 137 |
+
|
| 138 |
+
**Mid-volume fields with moderate gaps:**
|
| 139 |
+
|
| 140 |
+
| Field | Fine-tune | GPT-4o | Test support |
|
| 141 |
+
|---|---|---|---|
|
| 142 |
+
| `POLICY_NUMBER` | 0.827 | 0.985 | 97 |
|
| 143 |
+
| `GROUP_NUMBER` | 0.804 | 0.995 | 216 |
|
| 144 |
+
| `DEDUCTIBLE` | 0.758 | 0.964 | 190 |
|
| 145 |
+
|
| 146 |
+
`GROUP_NUMBER` has a 19-point gap despite 216 training-supported examples in test. The format variance for group numbers is wide (`4421`, `001428`, `GRP-882044`, `Group #44210`) and the cross-generator split exposes the train-set format bias.
|
| 147 |
+
|
| 148 |
+
**Low-volume / low-F1 fields:**
|
| 149 |
+
|
| 150 |
+
| Field | Fine-tune | GPT-4o | Test support |
|
| 151 |
+
|---|---|---|---|
|
| 152 |
+
| `RELATIONSHIP` | 0.703 | 0.802 | 207 |
|
| 153 |
+
| `PLAN_TYPE` | 0.688 | 0.956 | 359 |
|
| 154 |
+
| `AUTH_NUMBER` | 0.300 | 0.991 | 115 |
|
| 155 |
+
|
| 156 |
+
`AUTH_NUMBER` is the headline weakness β 0.30 vs 0.99. The training set has only 770 AUTH_NUMBER spans total, and the format space is wide (`PA-4421`, `auth #998-2210`, `AUTH998212`, etc.). A v2 with more AUTH_NUMBER coverage in training would likely close most of this gap.
|
| 157 |
+
|
| 158 |
+
`RELATIONSHIP` is a hard category for both models β short string ("self", "spouse", "child"), often overlapping with other entity contexts, with a tight span boundary that's easy to miss by one token.
|
| 159 |
+
|
| 160 |
+
## Recommended production pattern: hybrid
|
| 161 |
+
|
| 162 |
+
For a billing pipeline that processes inbound patient messages:
|
| 163 |
+
|
| 164 |
+
1. **Run this model first** on every message. Captures `CARRIER`, `MEMBER_ID`, `CLAIM_ID`, `SUBSCRIBER_NAME` with near-frontier F1 at ~45 ms / message on CPU.
|
| 165 |
+
2. **Add regex for high-value structured patterns**: dollar amounts (`\$[\d,]+(?:\.\d{2})?`), member ID format checks specific to your top carriers (Aetna IDs, BCBS IDs all follow distinct patterns).
|
| 166 |
+
3. **Use GPT-4o as a fallback** for messages where the fine-tune is uncertain or for AUTH_NUMBER / PLAN_TYPE detection. The fallback should fire on ~10β20% of messages, not 100%.
|
| 167 |
+
|
| 168 |
+
This pattern is the same architecture recommended for the PHI detector. The fine-tune does the bulk-volume linguistic work; frontier APIs handle the long tail. Together they cost an order of magnitude less than running frontier on every message.
|
| 169 |
+
|
| 170 |
+
## How to use
|
| 171 |
+
|
| 172 |
+
```python
|
| 173 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
| 174 |
+
import torch
|
| 175 |
+
|
| 176 |
+
model_id = "raihan-js/clarioscope-insurance-v1"
|
| 177 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 178 |
+
model = AutoModelForTokenClassification.from_pretrained(model_id)
|
| 179 |
+
model.eval()
|
| 180 |
+
|
| 181 |
+
text = ("Hi, I'd like to verify my coverage before tomorrow's appointment. "
|
| 182 |
+
"My carrier is Aetna PPO, member ID AET-998-2210, group #4421. "
|
| 183 |
+
"I'm the subscriber. My copay should be $35 β is that right?")
|
| 184 |
+
|
| 185 |
+
enc = tokenizer(text, return_offsets_mapping=True, return_tensors="pt", truncation=True, max_length=256)
|
| 186 |
+
offsets = enc.pop("offset_mapping")[0].tolist()
|
| 187 |
+
with torch.no_grad():
|
| 188 |
+
pred_ids = model(**enc).logits.argmax(dim=-1)[0].tolist()
|
| 189 |
+
|
| 190 |
+
id2label = model.config.id2label
|
| 191 |
+
spans = []
|
| 192 |
+
i = 0
|
| 193 |
+
while i < len(pred_ids):
|
| 194 |
+
label = id2label[pred_ids[i]]
|
| 195 |
+
if label.startswith("B-"):
|
| 196 |
+
ent_type = label[2:]
|
| 197 |
+
start = offsets[i][0]
|
| 198 |
+
end = offsets[i][1]
|
| 199 |
+
j = i + 1
|
| 200 |
+
while j < len(pred_ids) and id2label[pred_ids[j]] == f"I-{ent_type}":
|
| 201 |
+
end = offsets[j][1]
|
| 202 |
+
j += 1
|
| 203 |
+
spans.append({"text": text[start:end], "label": ent_type})
|
| 204 |
+
i = j
|
| 205 |
+
else:
|
| 206 |
+
i += 1
|
| 207 |
+
|
| 208 |
+
# Convert to structured JSON for downstream billing system
|
| 209 |
+
extracted = {span["label"]: span["text"] for span in spans}
|
| 210 |
+
print(extracted)
|
| 211 |
+
# {'CARRIER': 'Aetna', 'PLAN_TYPE': 'PPO', 'MEMBER_ID': 'AET-998-2210',
|
| 212 |
+
# 'GROUP_NUMBER': '4421', 'RELATIONSHIP': 'subscriber', 'COPAY': '$35'}
|
| 213 |
+
```
|
| 214 |
+
|
| 215 |
+
## Limitations
|
| 216 |
+
|
| 217 |
+
- **All training and evaluation data is synthetic.** No real patient billing data was used. Production deployment should include calibration against real-world inbound messages.
|
| 218 |
+
- **`AUTH_NUMBER` is materially weaker than frontier** (0.30 vs 0.99). Do not rely on this model alone for prior-auth extraction.
|
| 219 |
+
- **Format brittleness on structured IDs.** When a member ID, group number, or claim reference uses a format outside the training distribution, the model often produces a span boundary off by one token, which is a miss under strict matching.
|
| 220 |
+
- **English only, healthcare practice domain only.**
|
| 221 |
+
- **`RELATIONSHIP` accuracy is moderate** (0.70). The category overlaps with other named entities and short-span detection is fragile in cross-generator evaluation.
|
| 222 |
+
- **Benchmark scope.** Only GPT-4o was benchmarked in this release; Anthropic API credit was exhausted before Haiku 4.5 / Sonnet 4.6 could be run.
|
| 223 |
+
|
| 224 |
+
## Intended use
|
| 225 |
+
|
| 226 |
+
A first-pass insurance / billing field extractor in a hybrid intake pipeline for healthcare practice software. Strongest on high-volume fields (`CARRIER`, `MEMBER_ID`, `CLAIM_ID`, `SUBSCRIBER_NAME`). Should be paired with regex matchers (for dollar amounts and known ID formats) and a frontier-API fallback (for `AUTH_NUMBER` and `PLAN_TYPE` edge cases).
|
| 227 |
+
|
| 228 |
+
## Out-of-scope use
|
| 229 |
+
|
| 230 |
+
- **Sole reliance for prior-auth processing.** The model misses 70% of AUTH_NUMBERs on test data.
|
| 231 |
+
- **Direct downstream billing without validation.** Any extracted member ID, group number, or claim reference should be validated against the carrier's verification endpoint before being relied on for payment routing.
|
| 232 |
+
- **Adversarial inputs.** Not hardened against prompt injection or adversarial text crafted to manipulate extraction.
|
| 233 |
+
|
| 234 |
+
## Citation
|
| 235 |
+
|
| 236 |
+
```bibtex
|
| 237 |
+
@misc{sikder2026clarioscope_insurance,
|
| 238 |
+
author = {Sikder, Akteruzzaman Raihan},
|
| 239 |
+
title = {ClarioScope insurance extractor v1: a 125M-parameter RoBERTa fine-tune for structured insurance and billing extraction},
|
| 240 |
+
year = {2026},
|
| 241 |
+
publisher = {HuggingFace},
|
| 242 |
+
howpublished = {\url{https://huggingface.co/raihan-js/clarioscope-insurance-v1}},
|
| 243 |
+
}
|
| 244 |
+
```
|
| 245 |
+
|
| 246 |
+
## Read more
|
| 247 |
+
|
| 248 |
+
The full ClarioScope SLM Suite writeup β methodology, cost ledger, and per-model interpretation β is on dev.to. Each model has its own post; the suite-level summary is at the GitHub profile [github.com/raihan-js](https://github.com/raihan-js).
|
| 249 |
+
|
| 250 |
+
## Author
|
| 251 |
+
|
| 252 |
+
Built by [Akteruzzaman Raihan Sikder](https://huggingface.co/raihan-js) β CTO, [ClarioScope AI](https://clarioscope.ai). Part of the broader ClarioScope SLM Suite (intent classifier, PHI detector, insurance extractor) β a three-model intake intelligence pipeline.
|