---
language:
- en
license: apache-2.0
library_name: peft
base_model: Qwen/Qwen2.5-3B-Instruct
tags:
- clinical
- extraction
- medical
- qlora
- lora
- healthcare
- on-prem
datasets:
- synthea
pipeline_tag: text-generation
---

# Mira-1 — Clinical Extraction SLM

**Enterprise-grade clinical document extraction model.** Fine-tuned from Qwen2.5-3B-Instruct with QLoRA to extract structured JSON from clinical documents (lab reports, discharge summaries, medication lists, pathology reports, intake forms, progress notes).

## Key Features

- **Structured JSON output** — extracts patient demographics, vitals, labs, medications, diagnoses, procedures, allergies
- **Source-grounded** — every extracted value traces to the input document
- **No patient identifiers** — extracts age/sex only, strips names/MRN/DOB
- **On-prem deployable** — 3B parameters, runs on CPU via GGUF quantization
- **98% JSON validity** on held-out gold set

## Training

| Parameter | Value |
|-----------|-------|
| Base model | Qwen/Qwen2.5-3B-Instruct |
| Method | QLoRA (4-bit, r=16, alpha=32) |
| Training data | 3,438 examples (126 curated + 3,312 Synthea-rendered) |
| Epochs | 2 |
| Final loss | 0.14 |
| GPU | Kaggle T4 (free tier) |
| Training time | ~2h 40m |

## Evaluation (50 held-out gold examples)

| Metric | Value |
|--------|-------|
| JSON validity | 98% |
| Training loss | 1.23 → 0.14 |

## Usage

```python
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

model = AutoPeftModelForCausalLM.from_pretrained("shekharp77/Mira-1")
tokenizer = AutoTokenizer.from_pretrained("shekharp77/Mira-1")

messages = [
    {"role": "system", "content": "You are a clinical information extraction system..."},
    {"role": "user", "content": "Patient: 45/M\nHb 12.5 g/dL (13-17) LOW\nWBC 8.2 x10^9/L (4-11) Normal"},
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=2048, temperature=0)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Schema

Outputs conform to this schema (10 required top-level fields):
- `document_type`: lab_report | medication_list | discharge_summary | pathology_report | intake_form | progress_note | other
- `patient`: {age, sex}
- `encounter`: {date, department}
- `vitals[]`, `labs[]`, `medications[]`, `diagnoses[]`, `procedures[]`, `allergies[]`
- `extraction_notes`

## Limitations

- English only (v0)
- Trained on synthetic data (Synthea + curated seeds), not real clinical records
- Every output is a **draft for human review** — not for autonomous clinical decisions
- No ICD-10/SNOMED coding unless explicitly in the source document

## License

Apache-2.0 (same as base model)