Text Classification
Transformers
Safetensors
English
deberta-v2
intent-classification
healthcare
deberta
clarioscope
Eval Results (legacy)
text-embeddings-inference
Instructions to use raihan-js/clarioscope-intent-deberta-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use raihan-js/clarioscope-intent-deberta-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="raihan-js/clarioscope-intent-deberta-v1")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("raihan-js/clarioscope-intent-deberta-v1") model = AutoModelForSequenceClassification.from_pretrained("raihan-js/clarioscope-intent-deberta-v1") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| license: mit | |
| language: | |
| - en | |
| base_model: microsoft/deberta-v3-base | |
| pipeline_tag: text-classification | |
| tags: | |
| - text-classification | |
| - intent-classification | |
| - healthcare | |
| - deberta | |
| - clarioscope | |
| metrics: | |
| - accuracy | |
| - f1 | |
| model-index: | |
| - name: clarioscope-intent-deberta-v1 | |
| results: | |
| - task: | |
| type: text-classification | |
| name: Patient inquiry intent classification | |
| dataset: | |
| type: synthetic | |
| name: clarioscope-intent-suite | |
| metrics: | |
| - type: accuracy | |
| value: 0.9116 | |
| - type: macro-f1 | |
| value: 0.9107 | |
| - type: latency_ms_per_example | |
| value: 48.5 | |
| # ClarioScope intent classifier v1 | |
| A 184M-parameter [DeBERTa-v3-base](https://huggingface.co/microsoft/deberta-v3-base) fine-tune that classifies inbound patient text from healthcare practices into one of seven intent categories. It matches Claude Haiku 4.5 and GPT-4o within roughly 4 percentage points of accuracy at **22Γ lower latency** and **effectively $0 per inference** after deployment. This is the first model in the **ClarioScope SLM Suite** β a three-model intake intelligence pipeline for healthcare practices. | |
| ## TL;DR | |
| | Property | Value | | |
| |---|---| | |
| | Task | Seven-class intent classification of patient inquiries | | |
| | Base model | [microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base) (184M params) | | |
| | Training data | 9,000 synthetic examples (8,099 train / 899 val) generated via `gpt-4o-mini-2024-07-18` | | |
| | Test data | 1,154 synthetic examples generated by Claude with a deliberately different prompt style (prevents train/test leakage) | | |
| | Val accuracy | 88.77% | | |
| | Val macro F1 | 88.60% | | |
| | **Test accuracy vs frontier** | **91.16%** β within 4.2 pp of Claude Haiku 4.5 (95.32%) | | |
| | **Test latency** | **48.5 ms/example on CPU** β 22Γ faster than frontier API calls | | |
| | **Per-inference cost** | **$0** self-hosted vs $0.25β$0.76 per 1K for frontier APIs | | |
| | License | MIT | | |
|  | |
| ## Why this model exists | |
| Healthcare practices receive a flood of inbound patient text every day across email, contact forms, chat widgets, SMS, and voicemail transcripts. Routing this text correctly β to scheduling, billing, clinical, or front desk β is a high-volume, latency-sensitive, deterministic task. Frontier LLMs solve it well but are overkill: they cost real money per call, add roughly one second of network round-trip per message, and send patient text to a third party. | |
| This classifier is a small, fast, deterministic, on-premises alternative. It runs in 48 milliseconds on a CPU, costs nothing per inference after the one-time training run, and never leaves your infrastructure. The accuracy gap versus frontier APIs is real but small (~4 percentage points) β small enough that for production routing the speed, cost, and privacy wins dominate. | |
| This is model 1 of the ClarioScope SLM Suite: | |
| 1. **Intent classifier** (this model) β what does this message want? | |
| 2. **PHI detector** (in development) β what protected health information needs redaction? | |
| 3. **Insurance extractor** (in development) β what billing-relevant structured data is in this message? | |
| Each model is small, specialized, and faster than frontier APIs on its narrow task. The three compose into an end-to-end intake pipeline. | |
| ## Intent categories | |
| | Label | What it captures | | |
| |---|---| | |
| | `new_patient_inquiry` | A prospective patient asking about becoming a patient | | |
| | `existing_patient_question` | An existing patient with a non-urgent question | | |
| | `appointment_request` | Scheduling, rescheduling, or cancellation | | |
| | `billing_inquiry` | Questions about bills, insurance, or pricing of services already received | | |
| | `clinical_concern` | An active medical concern requiring clinical attention | | |
| | `complaint` | Dissatisfaction with service, staff, communication, or outcome | | |
| | `price_shopper` | Pricing-only inquiry, no commitment signals | | |
| Disambiguation rules used during data generation: `clinical_concern` vs `complaint` is resolved in favor of `complaint` when both signals are present (the dissatisfaction signal dominates routing). `billing_inquiry` is reserved for services *already received*; pre-commitment pricing questions are `price_shopper`. `appointment_request` covers anything scheduling-related from established patients; new-patient appointment requests stay under `new_patient_inquiry`. | |
| ## Architecture | |
| Standard DeBERTa-v3-base encoder with a sequence classification head: a single linear layer over the pooled `[CLS]` representation producing 7 output logits. All 184M parameters are fine-tuned; LoRA was not used (full fine-tuning at this dataset size produces the best results without parameter-efficient overhead). Training used fp16 mixed precision on a single RTX 4090, batch size 32, max sequence length 256 tokens, learning rate 2e-5 with a cosine schedule and 10% warmup, weight decay 0.01, and early stopping (patience 2 on macro F1). Training reached 5 epochs in roughly 5 minutes. | |
| ## Training data β be transparent about this | |
| All training and evaluation data is **synthetic**. There is **no real patient data** in this model or its evaluation. This is a deliberate choice to avoid HIPAA constraints for v1 and ship a fast first iteration. A v2 trained on real PHI would require HIPAA-eligible infrastructure (AWS SageMaker or Azure ML with a Business Associate Agreement); that is out of scope for v1. | |
| **Training set (9,000 examples β 8,099 train + 899 val).** Generated via the OpenAI API (`gpt-4o-mini-2024-07-18`, temperature 1.0, JSON-object response format) across 7 intents Γ 10 healthcare practice types Γ 5 channels Γ 8 tone modifiers Γ 5 age-profile modifiers. The generation prompt forces a mandatory realism mix per batch: roughly 40% polished, 40% casual (lowercase starts, contractions, fragments, missing punctuation), and 20% messy (typos, autocorrect mistakes, abbreviations like `u`/`appt`/`tmrw`, ALL CAPS for urgency, run-on phrasing). Channel-specific mess scaling: SMS is the messiest, voicemail transcripts are second messiest, email and web forms are more polished. This realism mix exists because real production patient text is not "ChatGPT polite," and a model trained only on clean text generalizes badly to the messy reality. | |
| **Test set (1,154 examples).** Generated by Claude with a deliberately different prompt style and a different abbreviation set (`w/`, `&`, `hrs`, `BTW`, `IDK`, `plz`) from the training prompt. This cross-model split mitigates the failure mode where train and test come from the same generation distribution and the benchmark inflates. The fine-tuned model has to generalize across two distinct generation styles, and so does every benchmarked frontier model. | |
| **Class distribution.** Train+val: 1,276β1,290 examples per intent (perfectly stratified to within 1% spread). Test: 160β169 examples per intent. | |
| **Practice type and channel coverage.** 10 practice types (dental, dermatology, physical therapy, ophthalmology, orthodontics, primary care, chiropractic, cosmetic surgery, fertility, mental health counseling), 5 channels (email body, website contact form, live chat widget message, SMS message, voicemail transcript). | |
| ## Benchmark β fine-tuned vs frontier APIs | |
| Evaluated on the held-out 1,154-example test set: | |
| | Model | Accuracy | Macro F1 | Latency / example | Cost / 1K inferences | | |
| |---|---|---|---|---| | |
| | **`raihan-js/clarioscope-intent-deberta-v1` (CPU)** | **91.16%** | **91.07%** | **48.5 ms** | **$0.00** | | |
| | `claude-haiku-4-5-20251001` | 95.32% | 95.28% | 1064 ms | $0.252 | | |
| | `claude-sonnet-4-6` | 93.59% | 93.53% | 1566 ms | $0.759 | | |
| | `gpt-4o-2024-11-20` | 95.23% | 95.17% | 1036 ms | $0.527 | | |
|  | |
| Three observations worth highlighting: | |
| **Sonnet 4.6 is worse than Haiku 4.5.** A bigger frontier model produces lower accuracy on this task. This isn't an artifact β it shows up consistently. For narrow, well-structured classification, more reasoning capacity sometimes second-guesses correct intuitions on short prompts. The right tool for the job here is small and specific. | |
| **Latency advantage is on CPU.** The 48ms number was measured on a CPU; on a modest GPU it drops another 5β10Γ. The frontier API latencies are network-bound β adding a GPU at the API side does nothing for you. The wall-clock floor for a hosted API call from a Bangladesh ISP is ~700ms even before the model processes anything. | |
| **Cost gap doesn't shrink at scale.** Frontier-API cost scales linearly with call volume. The fine-tuned model has a one-time training cost (~$1.20 of OpenAI credit + ~$1.20 of RunPod compute, total under $3) and approximately zero marginal per-inference cost. For a practice receiving 10,000 inbound messages per day, that's a hard dollar swing: $7,600/year on Claude Sonnet, $5,300/year on GPT-4o, $2,500/year on Haiku, or roughly $0 on this model. | |
| ## Per-class F1 (val set, 899 examples) | |
| | Intent | F1 | Support | | |
| |---|---|---| | |
| | `price_shopper` | 0.957 | 113 | | |
| | `complaint` | 0.929 | 153 | | |
| | `billing_inquiry` | 0.908 | 126 | | |
| | `appointment_request` | 0.881 | 118 | | |
| | `clinical_concern` | 0.874 | 143 | | |
| | `existing_patient_question` | 0.834 | 141 | | |
| | `new_patient_inquiry` | 0.819 | 105 | | |
| The hardest pairs to disambiguate are: | |
| - `new_patient_inquiry` β `appointment_request` β new patients asking for their first appointment are genuinely ambiguous, and the model lands on appointment more often than the label intends. | |
| - `existing_patient_question` β `clinical_concern` β medical questions from established patients read as low-grade concerns to the model. | |
| - `clinical_concern` β `complaint` β frustrated medical concerns combine both signals; the data-generation prompt resolves toward complaint, but the model occasionally goes the other way. | |
| These are the same pairs that gave Haiku 4.5 trouble too. They're real ambiguity in the task, not a classifier weakness specific to this fine-tune. | |
|  | |
| ## How to use | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| import torch | |
| model_id = "raihan-js/clarioscope-intent-deberta-v1" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForSequenceClassification.from_pretrained(model_id) | |
| model.eval() | |
| texts = [ | |
| "Hi, I'm new to the area and looking for a dermatologist. Are you accepting new patients?", | |
| "got a bill for $382 for my visit on 4/12 but my copay should only be $35 β what's the rest?", | |
| "my kid's fever is 103.2 and not coming down with tylenol. need advice now", | |
| ] | |
| inputs = tokenizer(texts, padding=True, truncation=True, max_length=256, return_tensors="pt") | |
| with torch.no_grad(): | |
| logits = model(**inputs).logits | |
| labels = [model.config.id2label[i] for i in logits.argmax(dim=-1).tolist()] | |
| print(labels) | |
| # ['new_patient_inquiry', 'billing_inquiry', 'clinical_concern'] | |
| ``` | |
| For production routing, consider also returning the maximum softmax probability as a confidence score and routing low-confidence predictions to a human reviewer. | |
| ## Limitations | |
| - **All training and evaluation data is synthetic.** While the realism mix in the generation prompt aims to approximate real production text, this model has not been evaluated on real patient inquiries from a live practice. Production deployment should include a calibration check against a sample of real inbound text before relying on the model for routing decisions. | |
| - **English only.** The training data is English; performance on other languages is undefined. | |
| - **Healthcare practice domain only.** The model is fine-tuned for inbound patient text at healthcare practices and has no signal for inquiries from other domains. | |
| - **Seven categories, not exhaustive.** The label set is opinionated. Messages that don't fit a category will get the closest available label rather than an "unknown" bucket. Threshold on the maximum softmax probability and defer low-confidence predictions to a human if your routing needs that. | |
| - **No PHI redaction is performed by this model.** This is a classification model. PHI detection / redaction is a separate model in the suite (`clarioscope-phi-deberta-v1`, in development). | |
| - **Synthetic-data style bias.** Despite using two different generation models for train and test, both are LLM-generated and may share systematic biases (overconfident messages, well-formed scenarios) that don't fully match real-world distribution tails. | |
| ## Intended use | |
| Production routing of inbound patient text at healthcare practices β deciding which inbox a message lands in (scheduling vs billing vs clinical vs front desk), with a human reviewer in the loop for low-confidence predictions. Also useful as a baseline for healthcare-specific NLP research where a small specialized classifier is more appropriate than a frontier API. | |
| ## Out-of-scope use | |
| - **Diagnostic decisions.** This is a routing classifier, not a clinical decision support tool. The `clinical_concern` label routes a message for human attention; it does not characterize the severity, urgency, or specifics of the medical issue. | |
| - **Determining HIPAA compliance.** This model classifies *intent*, not whether a message contains protected health information. HIPAA compliance is a regulatory determination that a model cannot make. | |
| - **Adversarial inputs.** The model has not been hardened against prompt injection or adversarial text crafted to manipulate routing. | |
| ## Citation | |
| ```bibtex | |
| @misc{sikder2026clarioscope_intent, | |
| author = {Sikder, Akteruzzaman Raihan}, | |
| title = {ClarioScope intent classifier v1: a 184M-parameter DeBERTa fine-tune for healthcare practice intent classification}, | |
| year = {2026}, | |
| publisher = {HuggingFace}, | |
| howpublished = {\url{https://huggingface.co/raihan-js/clarioscope-intent-deberta-v1}}, | |
| } | |
| ``` | |
| ## Read more | |
| A detailed methodology writeup, including the realism prompt design, the cross-model train/test split that prevents benchmark leakage, and the full cost ledger, is on dev.to: **[Matching frontier LLMs at 22Γ lower latency](https://dev.to/ryandevv/matching-frontier-llms-at-22x-lower-latency-a-184m-parameter-intent-classifier-for-healthcare-text-5ec2)**. | |
| ## Author | |
| Built by [Akteruzzaman Raihan Sikder](https://huggingface.co/raihan-js) β CTO, [ClarioScope AI](https://clarioscope.ai). Part of the broader ClarioScope SLM Suite (intent classifier, PHI detector, insurance extractor) β a three-model intake intelligence pipeline. | |