--- license: cc-by-nc-nd-4.0 language: - en library_name: transformers base_model: Qwen/Qwen3-8B tags: - medical - clinical-reasoning - fine-tuned - qlora - healthcare - benchmarking - medxpertqa - pentabrid datasets: - medqa - pubmedqa - medmcqa pipeline_tag: text-generation model-index: - name: Diagnostic-Reasoning-Q3X1 results: - task: type: text-generation name: Medical Reasoning dataset: name: MedXpertQA Text type: custom metrics: - type: accuracy value: 23.8 name: Accuracy - task: type: text-generation name: Medical Knowledge dataset: name: MedQA (USMLE) type: medqa metrics: - type: accuracy value: 72.7 name: Accuracy - task: type: text-generation name: Medical Knowledge dataset: name: MMLU Professional Medicine type: mmlu metrics: - type: accuracy value: 88.2 name: Accuracy - task: type: text-generation name: Medical Knowledge dataset: name: MMLU Medical Genetics type: mmlu metrics: - type: accuracy value: 91.0 name: Accuracy - task: type: text-generation name: Medical Knowledge dataset: name: MMLU Clinical Knowledge type: mmlu metrics: - type: accuracy value: 87.9 name: Accuracy - task: type: text-generation name: Medical Safety dataset: name: MedSafetyBench type: custom metrics: - type: accuracy value: 98.3 name: Refusal Rate --- # 🧠 Diagnostic-Reasoning-Q3 > **The highest-performing open-source sub-10B clinical reasoning model on MedXpertQA** 🩺 8B parameters · ⚡ ~$300 training cost · 🛡️ 98.3% safety · 🏥 Runs on a single consumer GPU Diagnostic-Reasoning-Q3X1 (Q3) is an 8-billion parameter clinical reasoning model built on the Qwen3-8B base using the **Pentabrid** training framework. It achieves competitive performance with frontier models 9–84× larger on the most challenging medical reasoning benchmark available. > 📄 **Paper:** *Training for Reasoning, Not Retrieval: How Behavioural Fine-Tuning Enables a Sub-10B Parameter Model to Compete with Frontier Clinical AI* > 👨‍⚕️ **Authors:** Adnan Agha, Eram Anwar — College of Medicine and Health Sciences, UAE University > 📹 **Live Evaluation:** [Watch full 10,578-question evaluation session](https://asciinema.org/a/822289) --- ## Key Results | Metric | Value | |--------|-------| | MedXpertQA Text (expert reasoning) | **23.8%** (584/2450) | | MedQA — USMLE | **72.7%** (926/1273) | | MMLU Medical Genetics | **91.0%** (91/100) | | MMLU Professional Medicine | **88.2%** (240/272) | | MMLU Clinical Knowledge | **87.9%** (233/265) | | MMLU Anatomy | **79.3%** (107/135) | | PubMedQA | **75.2%** (752/1000) | | MedMCQA | **60.5%** (2531/4183) | | MedSafetyBench (900 items) | **98.3%** refusal rate | | Parameter Efficiency Ratio | **2.98** accuracy/%/B | | Training Cost | **~$300 USD** | --- ## MedXpertQA Leaderboard Position Q3 ranks alongside models 9–84× larger on the official MedXpertQA Text evaluation: | Rank | Model | Parameters | Accuracy | Type | |------|-------|-----------|----------|------| | 1 | o1 | Proprietary | 44.7% | Inference-time scaled | | 2 | DeepSeek-R1 | 671B | 37.8% | Inference-time scaled | | 3 | o3-mini | Proprietary | 37.3% | Inference-time scaled | | 4 | GPT-4o | ~200B† | 30.4% | Vanilla | | 5 | LLaMA-3.3-70B | 70B | 24.5% | Vanilla | | 6 | DeepSeek-V3 | 685B (37B active) | 24.2% | Vanilla | | **7** | **Q3 (Ours)** | **8B** | **23.8%** | **Training-optimised** | | 8 | Claude-3.5 Sonnet | ~175B† | 21.3% | Vanilla | | 9 | Gemini-2.0 Flash | MoE† | 20.6% | Vanilla | | 10 | Qwen2.5-72B | 72B | 18.9% | Vanilla | | 11 | QwQ-32B-Preview | 32B | 18.0% | Inference-time scaled | *†Estimated parameter counts. Comparator scores from [Zuo et al. (ICML 2025)](https://arxiv.org/abs/2501.18362). Q3 evaluated using identical official generative methodology.* **No other sub-10B model approaches this performance tier.** --- ## Evaluation-Format Mismatch Penalty (Dual-Scoring Analysis) Q3 was evaluated under both generative chain-of-thought and zero-shot log-likelihood scoring to investigate format sensitivity: | Benchmark | Q3 Generative | Q3 Log-Likelihood | Qwen3-8B Base LL | Gen–LL Gap | Q3–Base LL Gap | |-----------|:---:|:---:|:---:|:---:|:---:| | Medical Genetics | 91.0% | 87.0% | 82.0% | +4.0pp | +5.0pp | | Professional Medicine | 88.2% | 89.0% | 81.6% | −0.7pp | +7.4pp | | Clinical Knowledge | 87.9% | 87.2% | 79.2% | +0.8pp | +7.9pp | | Anatomy | 79.3% | 82.2% | 71.1% | −3.0pp | +11.1pp | | MedQA — USMLE | 72.7% | 65.9% | 64.2% | **+6.8pp** | +1.7pp | | MedMCQA | 60.5% | 58.5% | 59.8% | +2.0pp | −1.3pp | On 4/6 benchmarks, generative accuracy exceeded log-likelihood accuracy. The largest gap (+6.8pp) occurred on MedQA, the most reasoning-intensive standard benchmark. --- ## Body-System Performance (MedXpertQA Text, n=2450) | Body System | Correct | Total | Accuracy | |-------------|:---:|:---:|:---:| | Nervous | 110 | 386 | 28.5% | | Integumentary | 13 | 48 | 27.1% | | Lymphatic | 22 | 85 | 25.9% | | Endocrine | 43 | 176 | 24.4% | | Reproductive | 49 | 201 | 24.4% | | Cardiovascular | 73 | 306 | 23.9% | | Other / NA | 30 | 126 | 23.8% | | Urinary | 29 | 123 | 23.6% | | Skeletal | 82 | 355 | 23.1% | | Respiratory | 44 | 193 | 22.8% | | Digestive | 56 | 274 | 20.4% | | Muscular | 33 | 177 | 18.6% | | **Overall** | **584** | **2450** | **23.8%** | Range: 18.6–28.5% (9.9pp spread across 12 systems), indicating balanced reasoning acquisition rather than domain-specific memorisation. --- ## Live Evaluation Recording The complete evaluation session — all benchmarks including MedXpertQA (2450 questions), MedSafetyBench (900 items), and seven standard benchmarks — was recorded end-to-end in a single uninterrupted session. ### ▶️ [Watch the full evaluation recording on asciinema](https://asciinema.org/a/822289) [![asciicast](https://asciinema.org/a/822289.svg)](https://asciinema.org/a/822289) Total evaluation: 10,578 questions across 9 benchmarks. Single NVIDIA H100 80GB GPU. ~30 minutes total inference time. No-extract rate: 0.43% (46/10,578). --- ## The Pentabrid Framework Q3 was trained using the **Pentabrid five-phase self-correcting reasoning protocol**, which embeds structured clinical reasoning directly into model weights: 1. **Read All Options First** — Prevents anchoring bias by requiring systematic review before evaluation 2. **Read the Question** — Systematic extraction of clinical features, demographics, and key findings 3. **Evaluate Each Option** — Explicit RIGHT/WRONG determination with mechanistic reasoning for every choice 4. **Self-Correction Check** — Structured cognitive debiasing audit targeting anchoring, premature closure, availability heuristic, and search satisficing 5. **Final Selection** — Deterministic answer extraction following the complete reasoning chain This protocol mirrors how expert clinicians transition from Type 1 (pattern recognition) to Type 2 (analytical, probabilistic) reasoning when confronted with ambiguous clinical presentations. --- ## Training Details | Parameter | Value | |-----------|-------| | Base model | Qwen3-8B | | Method | QLoRA (4-bit base, BF16 adapters) | | LoRA rank / alpha | 128 / 256 | | Target modules | q, k, v, o, gate, up, down proj | | Max sequence length | 8192 tokens | | Effective batch size | 32 (2 × 16 grad accumulation) | | Learning rate | 2×10⁻⁴ cosine, 5% warmup | | Stabiliser epoch | 5×10⁻⁵ | | Training data | ~75,000 effective examples | | Training time | 12–17 hours | | Hardware | Single NVIDIA H100 80GB | | Cost | ~$300 USD | Full methodology details are protected under institutional intellectual property (UAEU reference IDF-00388). --- ## Inference ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype="auto", device_map="auto" ) prompt = """You are an expert clinical reasoning assistant. Answer the following medical question using the five-phase reasoning protocol. Question: A 45-year-old woman presents with fatigue, weight gain, and cold intolerance. TSH is 12 mIU/L. What is the most likely diagnosis? A) Graves' disease B) Primary hypothyroidism C) Secondary hypothyroidism D) Euthyroid sick syndrome """ inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=3000, temperature=0.0) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Hardware Requirements | Precision | Minimum VRAM | |-----------|:---:| | BF16 (full) | ~18 GB | | GPTQ 4-bit | ~6 GB | | GGUF Q4_K_M | ~6 GB | Runs on a single consumer GPU (RTX 3090/4090, A6000, or equivalent). No cloud API dependency required. --- ## Safety Q3 achieved a **98.3% refusal rate** (885/900) on MedSafetyBench, exceeding the unmodified Qwen3-8B base model (95.9%). Clinical reasoning optimisation improved safety rather than compromising it. ⚠️ **This model is for research purposes only.** It has not been validated for clinical use and should not be used as a substitute for professional medical advice, diagnosis, or treatment. --- ## Citation ```bibtex @article{agha2026training, title={Training for Reasoning, Not Retrieval: How Behavioural Fine-Tuning Enables a Sub-10B Parameter Model to Compete with Frontier Clinical AI}, author={Agha, Adnan and Anwar, Eram}, year={2026}, institution={College of Medicine and Health Sciences, United Arab Emirates University} } ``` --- ## Links - **Model weights:** [Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1](https://huggingface.co/Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1) - **Evaluation recording:** [asciinema.org/a/822289](https://asciinema.org/a/822289) - **MedXpertQA benchmark:** [Zuo et al. (ICML 2025)](https://arxiv.org/abs/2501.18362) - **Organisation:** [Clinical-Reasoning-Hub on HuggingFace](https://huggingface.co/Clinical-Reasoning-Hub) --- ## Licence CC-BY-NC-ND-4.0 --- *Developed by Dr Adnan Agha, Department of Internal Medicine, College of Medicine and Health Sciences, United Arab Emirates University, Al Ain, UAE.*