---
license: cc-by-nc-nd-4.0
language:
  - en
library_name: transformers
base_model: Qwen/Qwen3-8B
tags:
  - medical
  - clinical-reasoning
  - fine-tuned
  - qlora
  - healthcare
  - benchmarking
  - medxpertqa
  - pentabrid
datasets:
  - medqa
  - pubmedqa
  - medmcqa
pipeline_tag: text-generation
model-index:
  - name: Diagnostic-Reasoning-Q3X1
    results:
      - task:
          type: text-generation
          name: Medical Reasoning
        dataset:
          name: MedXpertQA Text
          type: custom
        metrics:
          - type: accuracy
            value: 23.8
            name: Accuracy
      - task:
          type: text-generation
          name: Medical Knowledge
        dataset:
          name: MedQA (USMLE)
          type: medqa
        metrics:
          - type: accuracy
            value: 72.7
            name: Accuracy
      - task:
          type: text-generation
          name: Medical Knowledge
        dataset:
          name: MMLU Professional Medicine
          type: mmlu
        metrics:
          - type: accuracy
            value: 88.2
            name: Accuracy
      - task:
          type: text-generation
          name: Medical Knowledge
        dataset:
          name: MMLU Medical Genetics
          type: mmlu
        metrics:
          - type: accuracy
            value: 91.0
            name: Accuracy
      - task:
          type: text-generation
          name: Medical Knowledge
        dataset:
          name: MMLU Clinical Knowledge
          type: mmlu
        metrics:
          - type: accuracy
            value: 87.9
            name: Accuracy
      - task:
          type: text-generation
          name: Medical Safety
        dataset:
          name: MedSafetyBench
          type: custom
        metrics:
          - type: accuracy
            value: 98.3
            name: Refusal Rate
---

# 🧠 Diagnostic-Reasoning-Q3

> **The highest-performing open-source sub-10B clinical reasoning model on MedXpertQA**

🩺 8B parameters · ⚡ ~$300 training cost · 🛡️ 98.3% safety · 🏥 Runs on a single consumer GPU

Diagnostic-Reasoning-Q3X1 (Q3) is an 8-billion parameter clinical reasoning model built on the Qwen3-8B base using the **Pentabrid** training framework. It achieves competitive performance with frontier models 9–84× larger on the most challenging medical reasoning benchmark available.

> 📄 **Paper:** *Training for Reasoning, Not Retrieval: How Behavioural Fine-Tuning Enables a Sub-10B Parameter Model to Compete with Frontier Clinical AI*
> 👨‍⚕️ **Authors:** Adnan Agha, Eram Anwar — College of Medicine and Health Sciences, UAE University
> 📹 **Live Evaluation:** [Watch full 10,578-question evaluation session](https://asciinema.org/a/822289)

---

## Key Results

| Metric | Value |
|--------|-------|
| MedXpertQA Text (expert reasoning) | **23.8%** (584/2450) |
| MedQA — USMLE | **72.7%** (926/1273) |
| MMLU Medical Genetics | **91.0%** (91/100) |
| MMLU Professional Medicine | **88.2%** (240/272) |
| MMLU Clinical Knowledge | **87.9%** (233/265) |
| MMLU Anatomy | **79.3%** (107/135) |
| PubMedQA | **75.2%** (752/1000) |
| MedMCQA | **60.5%** (2531/4183) |
| MedSafetyBench (900 items) | **98.3%** refusal rate |
| Parameter Efficiency Ratio | **2.98** accuracy/%/B |
| Training Cost | **~$300 USD** |

---

## MedXpertQA Leaderboard Position

Q3 ranks alongside models 9–84× larger on the official MedXpertQA Text evaluation:

| Rank | Model | Parameters | Accuracy | Type |
|------|-------|-----------|----------|------|
| 1 | o1 | Proprietary | 44.7% | Inference-time scaled |
| 2 | DeepSeek-R1 | 671B | 37.8% | Inference-time scaled |
| 3 | o3-mini | Proprietary | 37.3% | Inference-time scaled |
| 4 | GPT-4o | ~200B† | 30.4% | Vanilla |
| 5 | LLaMA-3.3-70B | 70B | 24.5% | Vanilla |
| 6 | DeepSeek-V3 | 685B (37B active) | 24.2% | Vanilla |
| **7** | **Q3 (Ours)** | **8B** | **23.8%** | **Training-optimised** |
| 8 | Claude-3.5 Sonnet | ~175B† | 21.3% | Vanilla |
| 9 | Gemini-2.0 Flash | MoE† | 20.6% | Vanilla |
| 10 | Qwen2.5-72B | 72B | 18.9% | Vanilla |
| 11 | QwQ-32B-Preview | 32B | 18.0% | Inference-time scaled |

*†Estimated parameter counts. Comparator scores from [Zuo et al. (ICML 2025)](https://arxiv.org/abs/2501.18362). Q3 evaluated using identical official generative methodology.*

**No other sub-10B model approaches this performance tier.**

---

## Evaluation-Format Mismatch Penalty (Dual-Scoring Analysis)

Q3 was evaluated under both generative chain-of-thought and zero-shot log-likelihood scoring to investigate format sensitivity:

| Benchmark | Q3 Generative | Q3 Log-Likelihood | Qwen3-8B Base LL | Gen–LL Gap | Q3–Base LL Gap |
|-----------|:---:|:---:|:---:|:---:|:---:|
| Medical Genetics | 91.0% | 87.0% | 82.0% | +4.0pp | +5.0pp |
| Professional Medicine | 88.2% | 89.0% | 81.6% | −0.7pp | +7.4pp |
| Clinical Knowledge | 87.9% | 87.2% | 79.2% | +0.8pp | +7.9pp |
| Anatomy | 79.3% | 82.2% | 71.1% | −3.0pp | +11.1pp |
| MedQA — USMLE | 72.7% | 65.9% | 64.2% | **+6.8pp** | +1.7pp |
| MedMCQA | 60.5% | 58.5% | 59.8% | +2.0pp | −1.3pp |

On 4/6 benchmarks, generative accuracy exceeded log-likelihood accuracy. The largest gap (+6.8pp) occurred on MedQA, the most reasoning-intensive standard benchmark.

---

## Body-System Performance (MedXpertQA Text, n=2450)

| Body System | Correct | Total | Accuracy |
|-------------|:---:|:---:|:---:|
| Nervous | 110 | 386 | 28.5% |
| Integumentary | 13 | 48 | 27.1% |
| Lymphatic | 22 | 85 | 25.9% |
| Endocrine | 43 | 176 | 24.4% |
| Reproductive | 49 | 201 | 24.4% |
| Cardiovascular | 73 | 306 | 23.9% |
| Other / NA | 30 | 126 | 23.8% |
| Urinary | 29 | 123 | 23.6% |
| Skeletal | 82 | 355 | 23.1% |
| Respiratory | 44 | 193 | 22.8% |
| Digestive | 56 | 274 | 20.4% |
| Muscular | 33 | 177 | 18.6% |
| **Overall** | **584** | **2450** | **23.8%** |

Range: 18.6–28.5% (9.9pp spread across 12 systems), indicating balanced reasoning acquisition rather than domain-specific memorisation.

---

## Live Evaluation Recording

The complete evaluation session — all benchmarks including MedXpertQA (2450 questions), MedSafetyBench (900 items), and seven standard benchmarks — was recorded end-to-end in a single uninterrupted session.

### ▶️ [Watch the full evaluation recording on asciinema](https://asciinema.org/a/822289)

[![asciicast](https://asciinema.org/a/822289.svg)](https://asciinema.org/a/822289)

Total evaluation: 10,578 questions across 9 benchmarks. Single NVIDIA H100 80GB GPU. ~30 minutes total inference time. No-extract rate: 0.43% (46/10,578).

---

## The Pentabrid Framework

Q3 was trained using the **Pentabrid five-phase self-correcting reasoning protocol**, which embeds structured clinical reasoning directly into model weights:

1. **Read All Options First** — Prevents anchoring bias by requiring systematic review before evaluation
2. **Read the Question** — Systematic extraction of clinical features, demographics, and key findings
3. **Evaluate Each Option** — Explicit RIGHT/WRONG determination with mechanistic reasoning for every choice
4. **Self-Correction Check** — Structured cognitive debiasing audit targeting anchoring, premature closure, availability heuristic, and search satisficing
5. **Final Selection** — Deterministic answer extraction following the complete reasoning chain

This protocol mirrors how expert clinicians transition from Type 1 (pattern recognition) to Type 2 (analytical, probabilistic) reasoning when confronted with ambiguous clinical presentations.

---

## Training Details

| Parameter | Value |
|-----------|-------|
| Base model | Qwen3-8B |
| Method | QLoRA (4-bit base, BF16 adapters) |
| LoRA rank / alpha | 128 / 256 |
| Target modules | q, k, v, o, gate, up, down proj |
| Max sequence length | 8192 tokens |
| Effective batch size | 32 (2 × 16 grad accumulation) |
| Learning rate | 2×10⁻⁴ cosine, 5% warmup |
| Stabiliser epoch | 5×10⁻⁵ |
| Training data | ~75,000 effective examples |
| Training time | 12–17 hours |
| Hardware | Single NVIDIA H100 80GB |
| Cost | ~$300 USD |

Full methodology details are protected under institutional intellectual property (UAEU reference IDF-00388).

---

## Inference

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

prompt = """You are an expert clinical reasoning assistant. Answer the following medical question using the five-phase reasoning protocol.

Question: A 45-year-old woman presents with fatigue, weight gain, and cold intolerance. TSH is 12 mIU/L. What is the most likely diagnosis?

A) Graves' disease
B) Primary hypothyroidism
C) Secondary hypothyroidism
D) Euthyroid sick syndrome
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=3000, temperature=0.0)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Hardware Requirements

| Precision | Minimum VRAM |
|-----------|:---:|
| BF16 (full) | ~18 GB |
| GPTQ 4-bit | ~6 GB |
| GGUF Q4_K_M | ~6 GB |

Runs on a single consumer GPU (RTX 3090/4090, A6000, or equivalent). No cloud API dependency required.

---

## Safety

Q3 achieved a **98.3% refusal rate** (885/900) on MedSafetyBench, exceeding the unmodified Qwen3-8B base model (95.9%). Clinical reasoning optimisation improved safety rather than compromising it.

⚠️ **This model is for research purposes only.** It has not been validated for clinical use and should not be used as a substitute for professional medical advice, diagnosis, or treatment.

---

## Citation

```bibtex
@article{agha2026training,
  title={Training for Reasoning, Not Retrieval: How Behavioural Fine-Tuning Enables a Sub-10B Parameter Model to Compete with Frontier Clinical AI},
  author={Agha, Adnan and Anwar, Eram},
  year={2026},
  institution={College of Medicine and Health Sciences, United Arab Emirates University}
}
```

---

## Links

- **Model weights:** [Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1](https://huggingface.co/Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1)
- **Evaluation recording:** [asciinema.org/a/822289](https://asciinema.org/a/822289)
- **MedXpertQA benchmark:** [Zuo et al. (ICML 2025)](https://arxiv.org/abs/2501.18362)
- **Organisation:** [Clinical-Reasoning-Hub on HuggingFace](https://huggingface.co/Clinical-Reasoning-Hub)

---

## Licence

CC-BY-NC-ND-4.0

---

*Developed by Dr Adnan Agha, Department of Internal Medicine, College of Medicine and Health Sciences, United Arab Emirates University, Al Ain, UAE.*