--- license: apache-2.0 base_model: unsloth/Qwen2.5-3B-Instruct tags: - medical - question-answering - discharge-summary - clinical-nlp - healthcare - qlora - unsloth - ehr - patient-safety language: - en library_name: peft pipeline_tag: text-generation datasets: - AmareshHebbar/discharge-qa-sft co2_eq_emissions: emissions: 0 source: "estimate, not measured with a carbon-tracking tool" training_type: "fine-tuning" geographical_location: "EU-West" hardware_used: "NVIDIA A6000 (48GB)" model-index: - name: discharge-qa-qwen25-3b results: [] ---
# 💬 Discharge Summary QA ### Qwen2.5-3B fine-tuned for discharge summary qa [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Model-discharge--qa--qwen25--3b-FFD21E)](https://huggingface.co/AmareshHebbar/discharge-qa-qwen25-3b) [![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Dataset-discharge--qa--sft-blue)](https://huggingface.co/datasets/AmareshHebbar/discharge-qa-sft) [![License](https://img.shields.io/badge/license-Apache%202.0-green)](https://www.apache.org/licenses/LICENSE-2.0) [![Base Model](https://img.shields.io/badge/base-Qwen2.5--3B-orange)](https://huggingface.co/unsloth/Qwen2.5-3B-Instruct) [![Unsloth](https://img.shields.io/badge/-Unsloth-purple)](https://github.com/unslothai/unsloth) [![W&B](https://img.shields.io/badge/W%26B-tracked-yellow?logo=weightsandbiases)](https://wandb.ai/amareshhebbar-/axiomapper/runs/PASTE_RUN_ID) *Part of the [Medical AI Fine-tuned Model Suite](https://huggingface.co/AmareshHebbar/medical-ai-model-suite) — 16 specialist models, one per task*
--- ## TL;DR Answers specific questions about a patient's hospitalization using only the information in their discharge summary. ``` INPUT: DISCHARGE SUMMARY: [72M, CHF admission, discharged on furosemide 80mg, carvedilol 12.5mg BD, sacubitril/valsartan]\n\nQUESTION: What medications was the patient discharged on? OUTPUT: The patient was discharged on three medications: 1) Furosemide 80mg once daily, 2) Carvedilol 12.5mg twice daily, 3) Sacubitril/Valsartan 24/26mg twice daily. ``` | | | |---|---| | **Base model** | [unsloth/Qwen2.5-3B-Instruct](https://huggingface.co/unsloth/Qwen2.5-3B-Instruct) | | **Method** | QLoRA, 4-bit NF4, rank 16 | | **Training data** | [discharge-qa-sft](https://huggingface.co/datasets/AmareshHebbar/discharge-qa-sft) — 30,000 real-world rows | | **Training compute** | NVIDIA A6000 (48GB), ~1.5h | | **License** | Apache 2.0 | --- ## Architecture ``` +-------------------------+ user prompt --> | Qwen2.5-3B-Instruct | --> base weights (frozen, 4-bit NF4) | + LoRA adapter (r=16) | --> discharge-qa-qwen25-3b +-------------------------+ | v structured output (code / JSON / classification) ``` This repo contains **only the LoRA adapter** (~60MB), not the full merged weights. Load it on top of the base model as shown below — this keeps the download small and lets you swap adapters on one base model in memory. --- ## Intended use Let care teams or patients ask precise questions about a discharge summary instead of reading the entire document. ### Direct use Provide a discharge summary plus a question, get back an answer grounded in that document. ### Downstream use Power a patient-facing portal Q&A widget, or a care-transition checklist generator for receiving facilities. ### Out of scope Answering questions about information not contained in the provided summary — the model is not a general medical knowledge base and should not be asked open clinical questions unrelated to the document. > **This model is not a substitute for a certified medical professional's judgment.** Output should be reviewed by a qualified person before being used in a clinical or billing decision. --- ## Quickstart ### Option A — Transformers + PEFT ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torch base_model = "unsloth/Qwen2.5-3B-Instruct" adapter = "AmareshHebbar/discharge-qa-qwen25-3b" tokenizer = AutoTokenizer.from_pretrained(base_model) model = AutoModelForCausalLM.from_pretrained( base_model, torch_dtype=torch.bfloat16, device_map="auto", ) model = PeftModel.from_pretrained(model, adapter) messages = [ {"role": "system", "content": "You are a clinical QA assistant. Answer the question based on the discharge summary provided. Be specific and cite relevant details."}, {"role": "user", "content": "DISCHARGE SUMMARY: [72M, CHF admission, discharged on furosemide 80mg, carvedilol 12.5mg BD, sacubitril/valsartan]\n\nQUESTION: What medications was the patient discharged on?"}, ] inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device) outputs = model.generate(inputs, max_new_tokens=128, temperature=0.1, do_sample=True) print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)) ``` **Expected output:** ``` The patient was discharged on three medications: 1) Furosemide 80mg once daily, 2) Carvedilol 12.5mg twice daily, 3) Sacubitril/Valsartan 24/26mg twice daily. ``` ### Option B — Unsloth (2x faster load + inference) ```python from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name="AmareshHebbar/discharge-qa-qwen25-3b", max_seq_length=512, load_in_4bit=True, ) FastLanguageModel.for_inference(model) messages = [ {"role": "system", "content": "You are a clinical QA assistant. Answer the question based on the discharge summary provided. Be specific and cite relevant details."}, {"role": "user", "content": "QUESTION: What was the primary admitting diagnosis?"}, ] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=True) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) ``` ### Option C — vLLM (production serving, OpenAI-compatible) ```bash vllm serve unsloth/Qwen2.5-3B-Instruct \ --enable-lora \ --lora-modules discharge-qa-qwen25-3b=AmareshHebbar/discharge-qa-qwen25-3b \ --host 0.0.0.0 --port 8000 --dtype bfloat16 ``` ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed") response = client.chat.completions.create( model="discharge-qa-qwen25-3b", messages=[ {"role": "system", "content": "You are a clinical QA assistant. Answer the question based on the discharge summary provided. Be specific and cite relevant details."}, {"role": "user", "content": "QUESTION: When is the follow-up appointment scheduled?"}, ], temperature=0.1, ) print(response.choices[0].message.content) ``` ### Option D — GGUF / llama.cpp (CPU / edge inference) This repo ships LoRA adapter weights, not a pre-merged GGUF. To run on llama.cpp, merge first: ```bash pip install unsloth python -c " from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained('AmareshHebbar/discharge-qa-qwen25-3b', load_in_4bit=False) model.save_pretrained_gguf('discharge-qa-qwen25-3b-gguf', tokenizer, quantization_method='q4_k_m') " ``` --- ## Training details ### Data Trained on 30,000 examples extracted from **30k discharge summaries with structured QA pairs (AGBonnet/augmented-clinical-notes)** ([source](https://huggingface.co/datasets/AGBonnet/augmented-clinical-notes)). No synthetic or LLM-generated training data — every example pairs real-world input with its authoritative output. | Split | Rows | |---|---| | Train | 24,000 | | Validation | 3,000 | | Test | 3,000 | Full extraction pipeline documented on the [dataset card](https://huggingface.co/datasets/AmareshHebbar/discharge-qa-sft). ### Hyperparameters | Parameter | Value | |---|---| | LoRA rank (r) | 16 | | LoRA alpha | 32 | | LoRA dropout | 0 | | Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | | Quantization | 4-bit NF4 (QLoRA) | | Max sequence length | 512 | | Optimizer | paged_adamw_8bit | | LR schedule | 2e-4, cosine | | Gradient checkpointing | Unsloth (smart offload) | ### Training compute | | | |---|---| | **GPU** | NVIDIA A6000 (48GB) | | **Cloud provider** | RunPod | | **Training time** | ~1.5h (incl. eval + hub push) | | **Tracking** | [W&B run](https://wandb.ai/amareshhebbar-/axiomapper/runs/PASTE_RUN_ID) | | **CO2 estimate** | self-reported, not measured with a carbon tracker — treat as approximate | Fine-tuned with [Unsloth](https://github.com/unslothai/unsloth) for 2x faster training and reduced VRAM, using TRL's `SFTTrainer`. Full project: [wandb.ai/amareshhebbar-/axiomapper](https://wandb.ai/amareshhebbar-/axiomapper). --- ## Bias, risks & limitations **Data recency.** Training data reflects a specific snapshot in time (CMS FY2026 / dataset publish date). Codes, rates, and rules referenced may become outdated as source authorities issue updates — always cross-check against the live authoritative source before high-stakes use. **Failure mode.** Like any LLM, this model can produce a plausible-sounding but incorrect output, especially on rare, ambiguous, or highly compound real-world cases that fall outside the training distribution. It does not know when it's wrong. **Language.** English-language input only (Hindi-medical model excepted, where Hindi system prompts are used but underlying clinical reasoning data is largely English-sourced). **Not a regulated medical device.** This model has not been validated, cleared, or approved by any regulatory body (FDA, CDSCO, or equivalent) as a medical device or clinical decision support tool. It is a research/engineering artifact. **Misapplication risk.** Do not use this model as the sole basis for a clinical, billing, or compliance decision affecting a real patient or claim. Do not deploy in an emergency triage context without a human-in-the-loop and clear escalation paths. --- ## FAQ **Q: Can I merge the adapter into the base model for faster inference?** Yes — use `model.merge_and_unload()` after loading with PEFT, or use Unsloth's `save_pretrained_merged()` method. **Q: Why QLoRA instead of full fine-tuning?** The base model already has strong language and medical knowledge from pretraining. QLoRA adapts only ~0.5-1% of parameters, which is enough to specialize the output format and domain without the cost or overfitting risk of full fine-tuning. **Q: Can I fine-tune this further on my own data?** Yes, this adapter can be used as a starting checkpoint for continued fine-tuning. Note this may require merging first depending on your training framework. **Q: Why is the output format so strict?** Each task was trained on a fixed system prompt and consistent output structure. Following the documented system prompt closely (see Quickstart above) gives the most reliable results — deviating from it may produce inconsistent formatting. **Q: Does this model store or transmit my input data?** No. Like any open-weight model, all inference happens locally on your own infrastructure (or wherever you deploy it) — nothing is sent back to the model author. --- ## Troubleshooting | Symptom | Likely cause | Fix | |---|---|---| | `ValueError: padding_token not set` | Base tokenizer has no pad token | Set `tokenizer.pad_token = tokenizer.eos_token` before inference | | Garbled / repeated output | Wrong chat template applied | Make sure you use `tokenizer.apply_chat_template`, not a raw string prompt | | CUDA OOM on load | Insufficient VRAM | Use `load_in_4bit=True` (already default above) or reduce `max_seq_length` | | Adapter loads but ignores fine-tuning | Base model mismatch | Confirm you loaded the **exact** base listed above — adapters are not portable across different base models or quantizations | --- ## Related models in this suite | Model | Task | Size | |---|---|---| | [icd10-coder-qwen25-7b](https://huggingface.co/AmareshHebbar/icd10-coder-qwen25-7b) | ICD-10-CM medical coding | 7B | | [snomed-mapper-qwen25-7b](https://huggingface.co/AmareshHebbar/snomed-mapper-qwen25-7b) | Clinical concept mapping | 7B | | [icd10-to-drg-qwen25-1b](https://huggingface.co/AmareshHebbar/icd10-to-drg-qwen25-1b) | ICD-10 to DRG reimbursement | 1.5B | | [pmjay-classifier-qwen25-3b](https://huggingface.co/AmareshHebbar/pmjay-classifier-qwen25-3b) | India PM-JAY classification | 3B | **Full suite overview:** [AmareshHebbar/medical-ai-model-suite](https://huggingface.co/AmareshHebbar/medical-ai-model-suite) --- ## Changelog | Version | Date | Notes | |---|---|---| | v1.0 | 2026 | Initial release — QLoRA fine-tune on 30,000 real-world rows | --- ## Citation ```bibtex @misc{medicalai2026, author = {Hebbar, Amaresh}, title = {Medical AI Fine-tuning Suite}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/AmareshHebbar} } ``` ## Contact [![GitHub](https://img.shields.io/badge/GitHub-amareshhebbar-181717?logo=github)](https://github.com/amareshhebbar) [![LinkedIn](https://img.shields.io/badge/LinkedIn-gvamaresh-0A66C2?logo=linkedin)](https://www.linkedin.com/in/gvamaresh) [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Profile-AmareshHebbar-FFD21E)](https://huggingface.co/AmareshHebbar)