raihan-js commited on
Commit
b343316
Β·
verified Β·
1 Parent(s): 7aa25ad

Add README.md

Browse files
Files changed (1) hide show
  1. README.md +227 -40
README.md CHANGED
@@ -1,65 +1,252 @@
1
  ---
2
  library_name: transformers
3
  license: mit
 
 
4
  base_model: FacebookAI/roberta-base
 
5
  tags:
6
- - generated_from_trainer
 
 
 
 
 
 
 
 
7
  model-index:
8
  - name: clarioscope-insurance-v1
9
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
- should probably proofread and complete it, then remove this comment. -->
14
 
15
- # clarioscope-insurance-v1
16
 
17
- This model is a fine-tuned version of [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) on the None dataset.
18
- It achieves the following results on the evaluation set:
19
- - Loss: 0.0312
20
- - Macro F1: 0.9571
21
- - Weighted F1: 0.9706
22
 
23
- ## Model description
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
- More information needed
26
 
27
- ## Intended uses & limitations
28
 
29
- More information needed
30
 
31
- ## Training and evaluation data
32
 
33
- More information needed
34
 
35
- ## Training procedure
36
 
37
- ### Training hyperparameters
 
 
38
 
39
- The following hyperparameters were used during training:
40
- - learning_rate: 1e-05
41
- - train_batch_size: 8
42
- - eval_batch_size: 16
43
- - seed: 42
44
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
45
- - lr_scheduler_type: cosine
46
- - lr_scheduler_warmup_steps: 0.1
47
- - num_epochs: 5
48
 
49
- ### Training results
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
- | Training Loss | Epoch | Step | Validation Loss | Macro F1 | Weighted F1 |
52
- |:-------------:|:-----:|:----:|:---------------:|:--------:|:-----------:|
53
- | 0.0304 | 1.0 | 1011 | 0.0343 | 0.9486 | 0.9624 |
54
- | 0.0147 | 2.0 | 2022 | 0.0345 | 0.9537 | 0.9682 |
55
- | 0.0223 | 3.0 | 3033 | 0.0312 | 0.9571 | 0.9706 |
56
- | 0.0082 | 4.0 | 4044 | 0.0325 | 0.9562 | 0.9704 |
57
- | 0.0152 | 5.0 | 5055 | 0.0328 | 0.9563 | 0.9706 |
58
 
 
59
 
60
- ### Framework versions
61
 
62
- - Transformers 5.8.0
63
- - Pytorch 2.11.0+cu130
64
- - Datasets 4.8.5
65
- - Tokenizers 0.22.2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
  license: mit
4
+ language:
5
+ - en
6
  base_model: FacebookAI/roberta-base
7
+ pipeline_tag: token-classification
8
  tags:
9
+ - token-classification
10
+ - ner
11
+ - insurance
12
+ - billing
13
+ - healthcare
14
+ - roberta
15
+ - clarioscope
16
+ metrics:
17
+ - f1
18
  model-index:
19
  - name: clarioscope-insurance-v1
20
+ results:
21
+ - task:
22
+ type: token-classification
23
+ name: Insurance / billing field extraction
24
+ dataset:
25
+ type: synthetic
26
+ name: clarioscope-insurance-suite
27
+ metrics:
28
+ - type: macro-f1
29
+ value: 0.7882
30
+ - type: weighted-f1
31
+ value: 0.8202
32
+ - type: latency_ms_per_example
33
+ value: 45.4
34
  ---
35
 
36
+ # ClarioScope insurance extractor v1
 
37
 
38
+ A 125M-parameter [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base) fine-tune that extracts **structured insurance and billing fields** from inbound patient text. It tags spans across 12 field types β€” carrier, plan type, member ID, group number, claim ID, copay, deductible, billed amount, and more β€” and produces output that downstream billing systems can ingest as JSON. This is **model 3 of the [ClarioScope SLM Suite](https://huggingface.co/raihan-js)** β€” a three-model intake intelligence pipeline for healthcare practices.
39
 
40
+ ## TL;DR
 
 
 
 
41
 
42
+ | Property | Value |
43
+ |---|---|
44
+ | Task | Token classification over 12 insurance / billing field types (BIO tags, 25 labels) |
45
+ | Base model | [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) (125M params) |
46
+ | Training data | 8,086 synthetic examples (8,086 train / 897 val after cleanup) generated via `gpt-4o-mini-2024-07-18` |
47
+ | Test data | 672 synthetic examples generated by Claude Haiku 4.5 with a deliberately different prompt style |
48
+ | Val macro F1 | 95.71% |
49
+ | Val weighted F1 | 97.06% |
50
+ | **Test macro F1 vs GPT-4o** | **78.82%** vs gpt-4o-2024-11-20 95.62% |
51
+ | **Where fine-tune ties GPT-4o** | `SUBSCRIBER_NAME` (89 vs 91), `CLAIM_ID` (95 vs 100), `MEMBER_ID` (91 vs 99), `CARRIER` (90 vs 96) |
52
+ | **Where it loses badly** | `AUTH_NUMBER` (30 vs 99), `PLAN_TYPE` (69 vs 96) β€” low-frequency fields with high format variance |
53
+ | **Test latency** | **45.4 ms/example on CPU** β€” 26Γ— faster than GPT-4o |
54
+ | **Per-inference cost** | **$0** self-hosted vs $1.90 per 1K for GPT-4o |
55
+ | License | MIT |
56
 
57
+ > **Note on benchmark scope.** This release benchmarks against GPT-4o only. Anthropic API credit was exhausted during the development cycle before the Claude Haiku 4.5 / Sonnet 4.6 comparisons could be run. A subsequent revision of this card will add the Anthropic numbers.
58
 
59
+ ![Macro F1 vs latency](https://huggingface.co/raihan-js/clarioscope-insurance-v1/resolve/main/f1_vs_latency.png)
60
 
61
+ ## Why this model exists
62
 
63
+ Patient inquiries about insurance, billing, and prior authorizations have a different shape from the rest of intake. The information value is in the structured fields β€” carrier name, member ID, group number, claim reference, copay amount β€” not in the surrounding prose. A practice's billing system can act on `{"carrier": "Aetna", "plan_type": "PPO", "member_id": "AET-998-2210"}` directly; it cannot act on "hi can you check if my Aetna PPO covers..." until those fields are extracted.
64
 
65
+ Frontier LLMs do this extraction well, with one consistent caveat β€” they cost real money per call, add ~1 second of latency, and send patient text to a third party. This model is the self-hosted alternative: 26Γ— lower latency, $0 per inference after training, and patient text never leaves the host. It trails GPT-4o by 17 macro F1 points on aggregate, but it ties GPT-4o on the four fields with the most volume (`MEMBER_ID`, `CARRIER`, `CLAIM_ID`, `SUBSCRIBER_NAME`).
66
 
67
+ This is model 3 of the ClarioScope SLM Suite:
68
 
69
+ 1. **Intent classifier** ([`clarioscope-intent-deberta-v1`](https://huggingface.co/raihan-js/clarioscope-intent-deberta-v1)) β€” what does this message want?
70
+ 2. **PHI detector** ([`clarioscope-phi-deberta-v1`](https://huggingface.co/raihan-js/clarioscope-phi-deberta-v1)) β€” where is protected information in this message?
71
+ 3. **Insurance extractor** (this model) β€” what billing-relevant structured data is in this message?
72
 
73
+ ## The 12 field types
 
 
 
 
 
 
 
 
74
 
75
+ | Label | What it captures |
76
+ |---|---|
77
+ | `CARRIER` | Insurance company name (Aetna, Blue Cross Blue Shield, UnitedHealthcare, Cigna, Medicare, Medicaid, Kaiser Permanente, Humana, Anthem) |
78
+ | `PLAN_TYPE` | Plan format or tier (PPO, HMO, EPO, HDHP, POS, Gold / Silver / Bronze / Platinum) |
79
+ | `MEMBER_ID` | The patient's subscriber / member ID on the insurance card |
80
+ | `GROUP_NUMBER` | Group / employer group number |
81
+ | `POLICY_NUMBER` | Distinct policy or contract number when separate from member ID |
82
+ | `SUBSCRIBER_NAME` | Name of the primary subscriber on the policy |
83
+ | `RELATIONSHIP` | Relationship to subscriber (self, spouse, child, dependent, parent) |
84
+ | `CLAIM_ID` | Claim reference number |
85
+ | `AUTH_NUMBER` | Prior authorization / pre-cert number |
86
+ | `COPAY` | Copay dollar amount (includes the `$`) |
87
+ | `DEDUCTIBLE` | Deductible dollar amount (includes the `$`) |
88
+ | `BILLED_AMOUNT` | Total amount billed, owed, or charged (includes the `$`) |
89
 
90
+ The model outputs BIO token labels (25 total: `O` + 12 Γ— `{B-, I-}`), which downstream code converts back into character-offset spans and then into a JSON object.
 
 
 
 
 
 
91
 
92
+ ## Architecture
93
 
94
+ Standard RoBERTa-base encoder with a token-classification head: a linear layer over each token's contextualized representation, producing 25 logits per token. All 125M parameters are fine-tuned. Training uses fp32 (not mixed precision β€” see the PHI detector's card for the NaN-gradient story that motivated this), `adamw_torch` optimizer, `max_grad_norm=1.0`, an explicit classifier-head re-init (`std=0.02` normal, zero bias), batch size 8, sequence length 256, learning rate 1e-5 with cosine schedule and 10% warmup, weight decay 0.01, and early stopping (patience 2 on macro F1). Five epochs run in ~10 minutes on a single RTX A4000.
95
 
96
+ ## Training data β€” synthetic, transparent about it
97
+
98
+ All training and evaluation data is **synthetic**. There is **no real patient billing data** in this model or its evaluation.
99
+
100
+ **Training set (8,086 examples after cleanup, from 8,120 originally).** Generated via the OpenAI API (`gpt-4o-mini-2024-07-18`, JSON-object response format, temperature 1.0) across 12 field types Γ— healthcare practice types Γ— channels. The generation prompt enforces a 40/40/20 realism mix (polished / casual / messy) and channel-specific scaling (SMS messiest, voicemail second messiest, email and web forms cleaner).
101
+
102
+ **Data cleanup.** Same cue-word noise pattern observed in the PHI detector: `gpt-4o-mini` sometimes returns entity texts that include the cue word (`"member ID AET-998-2210"` instead of `"AET-998-2210"`, `"copay $35"` instead of `"$35"`). `clean_data.py` in the repo strips these cue-word prefixes and re-locates the cleaned text in the source. 1,393 entities (~7.4% of all spans) were normalized during cleanup; 256 entities became empty after stripping and were dropped.
103
+
104
+ **Test set (672 examples).** Generated by Claude Haiku 4.5 with a deliberately different prompt style and abbreviation set (`w/`, `&`, `ins`, `mbr`, `grp` versus the train prompt's more formal style). This cross-generator split mitigates benchmark leakage from training the model on one generator's style.
105
+
106
+ **Anti-leakage validation.** The cross-generator split produces a large val/test gap (val macro F1 0.96; test macro F1 0.79). The val set comes from the same `gpt-4o-mini` distribution as training; the test set comes from a different generator. The 17-point gap is the cost of generalizing across generators β€” and a useful proxy for the gap that would exist between this model and a real-world distribution. Val numbers from same-generator splits routinely overstate real-world generalization on tasks like this.
107
+
108
+ ## Benchmark β€” entity-level F1, span boundaries matter
109
+
110
+ Evaluated on the 672-example held-out test set using [seqeval](https://github.com/chakki-works/seqeval), which requires both the entity type AND the exact span boundary to match for a true positive.
111
+
112
+ | Model | Macro F1 | Weighted F1 | Latency / example | Cost / 1K inferences |
113
+ |---|---|---|---|---|
114
+ | **`raihan-js/clarioscope-insurance-v1` (CPU)** | **0.7882** | **0.8202** | **45.4 ms** | **$0.00** |
115
+ | `gpt-4o-2024-11-20` | 0.9562 | 0.9572 | 1202.3 ms | $1.90 |
116
+
117
+ ![F1 and cost comparison](https://huggingface.co/raihan-js/clarioscope-insurance-v1/resolve/main/f1_vs_cost.png)
118
+
119
+ GPT-4o wins on aggregate, but the per-entity breakdown shows the gap is concentrated in a few low-frequency fields.
120
+
121
+ ## Per-field breakdown
122
+
123
+ ![Per-field F1: fine-tuned vs GPT-4o](https://huggingface.co/raihan-js/clarioscope-insurance-v1/resolve/main/per_entity_f1.png)
124
+
125
+ **High-volume fields where fine-tune is competitive:**
126
+
127
+ | Field | Fine-tune | GPT-4o | Test support |
128
+ |---|---|---|---|
129
+ | `CLAIM_ID` | 0.954 | 0.997 | 146 |
130
+ | `MEMBER_ID` | 0.914 | 0.990 | 431 |
131
+ | `CARRIER` | 0.905 | 0.964 | 685 |
132
+ | `SUBSCRIBER_NAME` | 0.894 | 0.906 | 196 |
133
+ | `COPAY` | 0.864 | 0.954 | 179 |
134
+ | `BILLED_AMOUNT` | 0.847 | 0.971 | 219 |
135
+
136
+ `SUBSCRIBER_NAME` is essentially tied with GPT-4o (0.89 vs 0.91). The four highest-volume fields (`CARRIER`, `MEMBER_ID`, `CLAIM_ID`, `BILLED_AMOUNT`) collectively cover ~70% of the test entities, and the fine-tune is within 5–13 points of GPT-4o on each.
137
+
138
+ **Mid-volume fields with moderate gaps:**
139
+
140
+ | Field | Fine-tune | GPT-4o | Test support |
141
+ |---|---|---|---|
142
+ | `POLICY_NUMBER` | 0.827 | 0.985 | 97 |
143
+ | `GROUP_NUMBER` | 0.804 | 0.995 | 216 |
144
+ | `DEDUCTIBLE` | 0.758 | 0.964 | 190 |
145
+
146
+ `GROUP_NUMBER` has a 19-point gap despite 216 training-supported examples in test. The format variance for group numbers is wide (`4421`, `001428`, `GRP-882044`, `Group #44210`) and the cross-generator split exposes the train-set format bias.
147
+
148
+ **Low-volume / low-F1 fields:**
149
+
150
+ | Field | Fine-tune | GPT-4o | Test support |
151
+ |---|---|---|---|
152
+ | `RELATIONSHIP` | 0.703 | 0.802 | 207 |
153
+ | `PLAN_TYPE` | 0.688 | 0.956 | 359 |
154
+ | `AUTH_NUMBER` | 0.300 | 0.991 | 115 |
155
+
156
+ `AUTH_NUMBER` is the headline weakness β€” 0.30 vs 0.99. The training set has only 770 AUTH_NUMBER spans total, and the format space is wide (`PA-4421`, `auth #998-2210`, `AUTH998212`, etc.). A v2 with more AUTH_NUMBER coverage in training would likely close most of this gap.
157
+
158
+ `RELATIONSHIP` is a hard category for both models β€” short string ("self", "spouse", "child"), often overlapping with other entity contexts, with a tight span boundary that's easy to miss by one token.
159
+
160
+ ## Recommended production pattern: hybrid
161
+
162
+ For a billing pipeline that processes inbound patient messages:
163
+
164
+ 1. **Run this model first** on every message. Captures `CARRIER`, `MEMBER_ID`, `CLAIM_ID`, `SUBSCRIBER_NAME` with near-frontier F1 at ~45 ms / message on CPU.
165
+ 2. **Add regex for high-value structured patterns**: dollar amounts (`\$[\d,]+(?:\.\d{2})?`), member ID format checks specific to your top carriers (Aetna IDs, BCBS IDs all follow distinct patterns).
166
+ 3. **Use GPT-4o as a fallback** for messages where the fine-tune is uncertain or for AUTH_NUMBER / PLAN_TYPE detection. The fallback should fire on ~10–20% of messages, not 100%.
167
+
168
+ This pattern is the same architecture recommended for the PHI detector. The fine-tune does the bulk-volume linguistic work; frontier APIs handle the long tail. Together they cost an order of magnitude less than running frontier on every message.
169
+
170
+ ## How to use
171
+
172
+ ```python
173
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
174
+ import torch
175
+
176
+ model_id = "raihan-js/clarioscope-insurance-v1"
177
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
178
+ model = AutoModelForTokenClassification.from_pretrained(model_id)
179
+ model.eval()
180
+
181
+ text = ("Hi, I'd like to verify my coverage before tomorrow's appointment. "
182
+ "My carrier is Aetna PPO, member ID AET-998-2210, group #4421. "
183
+ "I'm the subscriber. My copay should be $35 β€” is that right?")
184
+
185
+ enc = tokenizer(text, return_offsets_mapping=True, return_tensors="pt", truncation=True, max_length=256)
186
+ offsets = enc.pop("offset_mapping")[0].tolist()
187
+ with torch.no_grad():
188
+ pred_ids = model(**enc).logits.argmax(dim=-1)[0].tolist()
189
+
190
+ id2label = model.config.id2label
191
+ spans = []
192
+ i = 0
193
+ while i < len(pred_ids):
194
+ label = id2label[pred_ids[i]]
195
+ if label.startswith("B-"):
196
+ ent_type = label[2:]
197
+ start = offsets[i][0]
198
+ end = offsets[i][1]
199
+ j = i + 1
200
+ while j < len(pred_ids) and id2label[pred_ids[j]] == f"I-{ent_type}":
201
+ end = offsets[j][1]
202
+ j += 1
203
+ spans.append({"text": text[start:end], "label": ent_type})
204
+ i = j
205
+ else:
206
+ i += 1
207
+
208
+ # Convert to structured JSON for downstream billing system
209
+ extracted = {span["label"]: span["text"] for span in spans}
210
+ print(extracted)
211
+ # {'CARRIER': 'Aetna', 'PLAN_TYPE': 'PPO', 'MEMBER_ID': 'AET-998-2210',
212
+ # 'GROUP_NUMBER': '4421', 'RELATIONSHIP': 'subscriber', 'COPAY': '$35'}
213
+ ```
214
+
215
+ ## Limitations
216
+
217
+ - **All training and evaluation data is synthetic.** No real patient billing data was used. Production deployment should include calibration against real-world inbound messages.
218
+ - **`AUTH_NUMBER` is materially weaker than frontier** (0.30 vs 0.99). Do not rely on this model alone for prior-auth extraction.
219
+ - **Format brittleness on structured IDs.** When a member ID, group number, or claim reference uses a format outside the training distribution, the model often produces a span boundary off by one token, which is a miss under strict matching.
220
+ - **English only, healthcare practice domain only.**
221
+ - **`RELATIONSHIP` accuracy is moderate** (0.70). The category overlaps with other named entities and short-span detection is fragile in cross-generator evaluation.
222
+ - **Benchmark scope.** Only GPT-4o was benchmarked in this release; Anthropic API credit was exhausted before Haiku 4.5 / Sonnet 4.6 could be run.
223
+
224
+ ## Intended use
225
+
226
+ A first-pass insurance / billing field extractor in a hybrid intake pipeline for healthcare practice software. Strongest on high-volume fields (`CARRIER`, `MEMBER_ID`, `CLAIM_ID`, `SUBSCRIBER_NAME`). Should be paired with regex matchers (for dollar amounts and known ID formats) and a frontier-API fallback (for `AUTH_NUMBER` and `PLAN_TYPE` edge cases).
227
+
228
+ ## Out-of-scope use
229
+
230
+ - **Sole reliance for prior-auth processing.** The model misses 70% of AUTH_NUMBERs on test data.
231
+ - **Direct downstream billing without validation.** Any extracted member ID, group number, or claim reference should be validated against the carrier's verification endpoint before being relied on for payment routing.
232
+ - **Adversarial inputs.** Not hardened against prompt injection or adversarial text crafted to manipulate extraction.
233
+
234
+ ## Citation
235
+
236
+ ```bibtex
237
+ @misc{sikder2026clarioscope_insurance,
238
+ author = {Sikder, Akteruzzaman Raihan},
239
+ title = {ClarioScope insurance extractor v1: a 125M-parameter RoBERTa fine-tune for structured insurance and billing extraction},
240
+ year = {2026},
241
+ publisher = {HuggingFace},
242
+ howpublished = {\url{https://huggingface.co/raihan-js/clarioscope-insurance-v1}},
243
+ }
244
+ ```
245
+
246
+ ## Read more
247
+
248
+ The full ClarioScope SLM Suite writeup β€” methodology, cost ledger, and per-model interpretation β€” is on dev.to. Each model has its own post; the suite-level summary is at the GitHub profile [github.com/raihan-js](https://github.com/raihan-js).
249
+
250
+ ## Author
251
+
252
+ Built by [Akteruzzaman Raihan Sikder](https://huggingface.co/raihan-js) β€” CTO, [ClarioScope AI](https://clarioscope.ai). Part of the broader ClarioScope SLM Suite (intent classifier, PHI detector, insurance extractor) β€” a three-model intake intelligence pipeline.