Cache-bust the headline chart URL

60151bf verified about 2 months ago

14.8 kB

library_name: transformers
license: mit
language:
  - en
base_model: microsoft/deberta-v3-base
pipeline_tag: text-classification
tags:
  - text-classification
  - intent-classification
  - healthcare
  - deberta
  - clarioscope
metrics:
  - accuracy
  - f1
model-index:
  - name: clarioscope-intent-deberta-v1
    results:
      - task:
          type: text-classification
          name: Patient inquiry intent classification
        dataset:
          type: synthetic
          name: clarioscope-intent-suite
        metrics:
          - type: accuracy
            value: 0.9116
          - type: macro-f1
            value: 0.9107
          - type: latency_ms_per_example
            value: 48.5

ClarioScope intent classifier v1

A 184M-parameter DeBERTa-v3-base fine-tune that classifies inbound patient text from healthcare practices into one of seven intent categories. It matches Claude Haiku 4.5 and GPT-4o within roughly 4 percentage points of accuracy at 22× lower latency and effectively $0 per inference after deployment. This is the first model in the ClarioScope SLM Suite — a three-model intake intelligence pipeline for healthcare practices.

TL;DR

Property	Value
Task	Seven-class intent classification of patient inquiries
Base model	microsoft/deberta-v3-base (184M params)
Training data	9,000 synthetic examples (8,099 train / 899 val) generated via `gpt-4o-mini-2024-07-18`
Test data	1,154 synthetic examples generated by Claude with a deliberately different prompt style (prevents train/test leakage)
Val accuracy	88.77%
Val macro F1	88.60%
Test accuracy vs frontier	91.16% — within 4.2 pp of Claude Haiku 4.5 (95.32%)
Test latency	48.5 ms/example on CPU — 22× faster than frontier API calls
Per-inference cost	$0 self-hosted vs $0.25–$0.76 per 1K for frontier APIs
License	MIT

Why this model exists

Healthcare practices receive a flood of inbound patient text every day across email, contact forms, chat widgets, SMS, and voicemail transcripts. Routing this text correctly — to scheduling, billing, clinical, or front desk — is a high-volume, latency-sensitive, deterministic task. Frontier LLMs solve it well but are overkill: they cost real money per call, add roughly one second of network round-trip per message, and send patient text to a third party.

This classifier is a small, fast, deterministic, on-premises alternative. It runs in 48 milliseconds on a CPU, costs nothing per inference after the one-time training run, and never leaves your infrastructure. The accuracy gap versus frontier APIs is real but small (~4 percentage points) — small enough that for production routing the speed, cost, and privacy wins dominate.

This is model 1 of the ClarioScope SLM Suite:

Intent classifier (this model) — what does this message want?
PHI detector (in development) — what protected health information needs redaction?
Insurance extractor (in development) — what billing-relevant structured data is in this message?

Each model is small, specialized, and faster than frontier APIs on its narrow task. The three compose into an end-to-end intake pipeline.

Intent categories

Label	What it captures
`new_patient_inquiry`	A prospective patient asking about becoming a patient
`existing_patient_question`	An existing patient with a non-urgent question
`appointment_request`	Scheduling, rescheduling, or cancellation
`billing_inquiry`	Questions about bills, insurance, or pricing of services already received
`clinical_concern`	An active medical concern requiring clinical attention
`complaint`	Dissatisfaction with service, staff, communication, or outcome
`price_shopper`	Pricing-only inquiry, no commitment signals

Disambiguation rules used during data generation: clinical_concern vs complaint is resolved in favor of complaint when both signals are present (the dissatisfaction signal dominates routing). billing_inquiry is reserved for services already received; pre-commitment pricing questions are price_shopper. appointment_request covers anything scheduling-related from established patients; new-patient appointment requests stay under new_patient_inquiry.

Architecture

Standard DeBERTa-v3-base encoder with a sequence classification head: a single linear layer over the pooled [CLS] representation producing 7 output logits. All 184M parameters are fine-tuned; LoRA was not used (full fine-tuning at this dataset size produces the best results without parameter-efficient overhead). Training used fp16 mixed precision on a single RTX 4090, batch size 32, max sequence length 256 tokens, learning rate 2e-5 with a cosine schedule and 10% warmup, weight decay 0.01, and early stopping (patience 2 on macro F1). Training reached 5 epochs in roughly 5 minutes.

Training data — be transparent about this

All training and evaluation data is synthetic. There is no real patient data in this model or its evaluation. This is a deliberate choice to avoid HIPAA constraints for v1 and ship a fast first iteration. A v2 trained on real PHI would require HIPAA-eligible infrastructure (AWS SageMaker or Azure ML with a Business Associate Agreement); that is out of scope for v1.

Training set (9,000 examples — 8,099 train + 899 val). Generated via the OpenAI API (gpt-4o-mini-2024-07-18, temperature 1.0, JSON-object response format) across 7 intents × 10 healthcare practice types × 5 channels × 8 tone modifiers × 5 age-profile modifiers. The generation prompt forces a mandatory realism mix per batch: roughly 40% polished, 40% casual (lowercase starts, contractions, fragments, missing punctuation), and 20% messy (typos, autocorrect mistakes, abbreviations like u/appt/tmrw, ALL CAPS for urgency, run-on phrasing). Channel-specific mess scaling: SMS is the messiest, voicemail transcripts are second messiest, email and web forms are more polished. This realism mix exists because real production patient text is not "ChatGPT polite," and a model trained only on clean text generalizes badly to the messy reality.

Test set (1,154 examples). Generated by Claude with a deliberately different prompt style and a different abbreviation set (w/, &, hrs, BTW, IDK, plz) from the training prompt. This cross-model split mitigates the failure mode where train and test come from the same generation distribution and the benchmark inflates. The fine-tuned model has to generalize across two distinct generation styles, and so does every benchmarked frontier model.

Class distribution. Train+val: 1,276–1,290 examples per intent (perfectly stratified to within 1% spread). Test: 160–169 examples per intent.

Practice type and channel coverage. 10 practice types (dental, dermatology, physical therapy, ophthalmology, orthodontics, primary care, chiropractic, cosmetic surgery, fertility, mental health counseling), 5 channels (email body, website contact form, live chat widget message, SMS message, voicemail transcript).

Benchmark — fine-tuned vs frontier APIs

Evaluated on the held-out 1,154-example test set:

Model	Accuracy	Macro F1	Latency / example	Cost / 1K inferences
`raihan-js/clarioscope-intent-deberta-v1` (CPU)	91.16%	91.07%	48.5 ms	$0.00
`claude-haiku-4-5-20251001`	95.32%	95.28%	1064 ms	$0.252
`claude-sonnet-4-6`	93.59%	93.53%	1566 ms	$0.759
`gpt-4o-2024-11-20`	95.23%	95.17%	1036 ms	$0.527

Three observations worth highlighting:

Sonnet 4.6 is worse than Haiku 4.5. A bigger frontier model produces lower accuracy on this task. This isn't an artifact — it shows up consistently. For narrow, well-structured classification, more reasoning capacity sometimes second-guesses correct intuitions on short prompts. The right tool for the job here is small and specific.

Latency advantage is on CPU. The 48ms number was measured on a CPU; on a modest GPU it drops another 5–10×. The frontier API latencies are network-bound — adding a GPU at the API side does nothing for you. The wall-clock floor for a hosted API call from a Bangladesh ISP is ~700ms even before the model processes anything.

Cost gap doesn't shrink at scale. Frontier-API cost scales linearly with call volume. The fine-tuned model has a one-time training cost (~$1.20 of OpenAI credit + ~$1.20 of RunPod compute, total under $3) and approximately zero marginal per-inference cost. For a practice receiving 10,000 inbound messages per day, that's a hard dollar swing: $7,600/year on Claude Sonnet, $5,300/year on GPT-4o, $2,500/year on Haiku, or roughly $0 on this model.

Per-class F1 (val set, 899 examples)

Intent	F1	Support
`price_shopper`	0.957	113
`complaint`	0.929	153
`billing_inquiry`	0.908	126
`appointment_request`	0.881	118
`clinical_concern`	0.874	143
`existing_patient_question`	0.834	141
`new_patient_inquiry`	0.819	105

The hardest pairs to disambiguate are:

new_patient_inquiry ↔ appointment_request — new patients asking for their first appointment are genuinely ambiguous, and the model lands on appointment more often than the label intends.
existing_patient_question ↔ clinical_concern — medical questions from established patients read as low-grade concerns to the model.
clinical_concern ↔ complaint — frustrated medical concerns combine both signals; the data-generation prompt resolves toward complaint, but the model occasionally goes the other way.

These are the same pairs that gave Haiku 4.5 trouble too. They're real ambiguity in the task, not a classifier weakness specific to this fine-tune.

How to use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "raihan-js/clarioscope-intent-deberta-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

texts = [
    "Hi, I'm new to the area and looking for a dermatologist. Are you accepting new patients?",
    "got a bill for $382 for my visit on 4/12 but my copay should only be $35 — what's the rest?",
    "my kid's fever is 103.2 and not coming down with tylenol. need advice now",
]

inputs = tokenizer(texts, padding=True, truncation=True, max_length=256, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
labels = [model.config.id2label[i] for i in logits.argmax(dim=-1).tolist()]
print(labels)
# ['new_patient_inquiry', 'billing_inquiry', 'clinical_concern']

For production routing, consider also returning the maximum softmax probability as a confidence score and routing low-confidence predictions to a human reviewer.

Limitations

All training and evaluation data is synthetic. While the realism mix in the generation prompt aims to approximate real production text, this model has not been evaluated on real patient inquiries from a live practice. Production deployment should include a calibration check against a sample of real inbound text before relying on the model for routing decisions.
English only. The training data is English; performance on other languages is undefined.
Healthcare practice domain only. The model is fine-tuned for inbound patient text at healthcare practices and has no signal for inquiries from other domains.
Seven categories, not exhaustive. The label set is opinionated. Messages that don't fit a category will get the closest available label rather than an "unknown" bucket. Threshold on the maximum softmax probability and defer low-confidence predictions to a human if your routing needs that.
No PHI redaction is performed by this model. This is a classification model. PHI detection / redaction is a separate model in the suite (clarioscope-phi-deberta-v1, in development).
Synthetic-data style bias. Despite using two different generation models for train and test, both are LLM-generated and may share systematic biases (overconfident messages, well-formed scenarios) that don't fully match real-world distribution tails.

Intended use

Production routing of inbound patient text at healthcare practices — deciding which inbox a message lands in (scheduling vs billing vs clinical vs front desk), with a human reviewer in the loop for low-confidence predictions. Also useful as a baseline for healthcare-specific NLP research where a small specialized classifier is more appropriate than a frontier API.

Out-of-scope use

Diagnostic decisions. This is a routing classifier, not a clinical decision support tool. The clinical_concern label routes a message for human attention; it does not characterize the severity, urgency, or specifics of the medical issue.
Determining HIPAA compliance. This model classifies intent, not whether a message contains protected health information. HIPAA compliance is a regulatory determination that a model cannot make.
Adversarial inputs. The model has not been hardened against prompt injection or adversarial text crafted to manipulate routing.

Citation

@misc{sikder2026clarioscope_intent,
  author = {Sikder, Akteruzzaman Raihan},
  title  = {ClarioScope intent classifier v1: a 184M-parameter DeBERTa fine-tune for healthcare practice intent classification},
  year   = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/raihan-js/clarioscope-intent-deberta-v1}},
}

A detailed methodology writeup, including the realism prompt design, the cross-model train/test split that prevents benchmark leakage, and the full cost ledger, is on dev.to: Matching frontier LLMs at 22× lower latency.

Author

Built by Akteruzzaman Raihan Sikder — CTO, ClarioScope AI. Part of the broader ClarioScope SLM Suite (intent classifier, PHI detector, insurance extractor) — a three-model intake intelligence pipeline.

raihan-js
/

clarioscope-intent-deberta-v1