--- language: - en license: apache-2.0 library_name: peft base_model: Qwen/Qwen2.5-3B-Instruct tags: - clinical - extraction - medical - qlora - lora - healthcare - on-prem datasets: - synthea pipeline_tag: text-generation --- # Mira-1 — Clinical Extraction SLM **Enterprise-grade clinical document extraction model.** Fine-tuned from Qwen2.5-3B-Instruct with QLoRA to extract structured JSON from clinical documents (lab reports, discharge summaries, medication lists, pathology reports, intake forms, progress notes). ## Key Features - **Structured JSON output** — extracts patient demographics, vitals, labs, medications, diagnoses, procedures, allergies - **Source-grounded** — every extracted value traces to the input document - **No patient identifiers** — extracts age/sex only, strips names/MRN/DOB - **On-prem deployable** — 3B parameters, runs on CPU via GGUF quantization - **98% JSON validity** on held-out gold set ## Training | Parameter | Value | |-----------|-------| | Base model | Qwen/Qwen2.5-3B-Instruct | | Method | QLoRA (4-bit, r=16, alpha=32) | | Training data | 3,438 examples (126 curated + 3,312 Synthea-rendered) | | Epochs | 2 | | Final loss | 0.14 | | GPU | Kaggle T4 (free tier) | | Training time | ~2h 40m | ## Evaluation (50 held-out gold examples) | Metric | Value | |--------|-------| | JSON validity | 98% | | Training loss | 1.23 → 0.14 | ## Usage ```python from peft import AutoPeftModelForCausalLM from transformers import AutoTokenizer model = AutoPeftModelForCausalLM.from_pretrained("shekharp77/Mira-1") tokenizer = AutoTokenizer.from_pretrained("shekharp77/Mira-1") messages = [ {"role": "system", "content": "You are a clinical information extraction system..."}, {"role": "user", "content": "Patient: 45/M\nHb 12.5 g/dL (13-17) LOW\nWBC 8.2 x10^9/L (4-11) Normal"}, ] inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True) outputs = model.generate(inputs, max_new_tokens=2048, temperature=0) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Schema Outputs conform to this schema (10 required top-level fields): - `document_type`: lab_report | medication_list | discharge_summary | pathology_report | intake_form | progress_note | other - `patient`: {age, sex} - `encounter`: {date, department} - `vitals[]`, `labs[]`, `medications[]`, `diagnoses[]`, `procedures[]`, `allergies[]` - `extraction_notes` ## Limitations - English only (v0) - Trained on synthetic data (Synthea + curated seeds), not real clinical records - Every output is a **draft for human review** — not for autonomous clinical decisions - No ICD-10/SNOMED coding unless explicitly in the source document ## License Apache-2.0 (same as base model)