ZeroTime-Bot: Medical Triage Alignment with GRPO and Unsloth
From "Over-Triage" to Clinical Accuracy
The Challenge: Taming AI's Safety Bias in Medical Triage
In critical sectors like healthcare, AI's inherent "safety-first" bias often leads to over-triage. A model, eager not to miss a critical condition, might classify even a minor injury (like a stubbed toe) as an "Emergency" (Level 1). While seemingly benign, this "triage inflation" can dangerously overburden emergency rooms, diverting resources from truly critical patients.
Our goal with ZeroTime-Bot was to create an AI medical triage agent that:
- Maintains patient safety for genuine emergencies.
- Accurately de-escalates minor injuries to appropriate non-urgent levels.
- Provides transparent reasoning for its decisions.
The Solution: Unleashing GRPO with Unsloth
To tackle this challenge, we leveraged a powerful combination of technologies:
- Unsloth: For efficient and significantly faster fine-tuning of Large Language Models (LLMs). Unsloth's optimizations were critical in iterating quickly on our RL experiments within Google Colab's High-RAM environment.
- GRPO (Group Relative Policy Optimization): A cutting-edge Reinforcement Learning (RL) algorithm designed for LLM alignment. Unlike traditional Supervised Fine-Tuning (SFT) which teaches imitation, GRPO enables the model to reason through a reward-based feedback loop.
Why GRPO for Medical Triage?
GRPO offers several key advantages for this specific task:
- "Critic-Less" RL: It sidesteps the need for a separate Value (Critic) model, which drastically reduces VRAM requirements and training time. This allowed us to align a robust Llama-3.1 8B model effectively.
- Comparative Learning: For each medical scenario, GRPO generates a "group" of possible reasoning paths and triage decisions. It then compares these generated responses against a human-defined reward function.
- Targeted Alignment: This comparative approach allowed us to specifically penalize "over-triage" while rewarding clinically sound and safe decisions.
Our Training Journey: Overcoming Colab's Challenges
The path to ZeroTime-Bot wasn't without its hurdles:
- Colab Runtime Instability: Initial attempts faced frequent
NameErrorandModuleNotFoundErrordue to Colab runtime resets and library version mismatches. - High-RAM Requirement: The 16-bit merge phase of the 8B Llama model required a dedicated High-RAM Colab Pro instance to prevent memory crashes.
- GGUF Export Issues: Ensuring the final 5GB
.gguffile correctly synced to Google Drive for Hugging Face upload required specificshutil.copycommands to bypass Colab's syncing latency. - Ollama Integration: Getting the model running locally on Mac via Ollama involved precise
Modelfilecreation and troubleshootinginvalid model nameerrors, reinforcing the importance of clean terminal commands.
The Heart of Alignment: Our Reward Function
The core of GRPO's success lay in our custom reward functions, especially the correctness_reward_func. This function wasn't just about matching an exact answer; it was about shaping the model's judgment:
def correctness_reward_func(prompts, completions, answer, **kwargs):
"""
Rewards the model for accurately assigning triage levels (1, 2, or 3),
with a specific focus on penalizing over-triage for minor conditions.
"""
responses = [completion[0]['content'] for completion in completions]
extracted = [re.search(r"<answer>(.*?)</answer>", r).group(1).strip() if re.search(r"<answer>(.*?)</answer>", r) else "" for r in responses]
rewards = []
for ext, ans in zip(extracted, answer):
# High reward for correct match, especially for de-escalation
if str(ext) == str(ans):
rewards.append(2.0)
# Moderate penalty for over-triage (e.g., Level 3 becomes Level 1/2)
elif int(ext) < int(ans): # If the model triaged higher than it should have
rewards.append(-1.0)
else: # Incorrect or under-triage
rewards.append(0.0) # Or a smaller penalty
return rewards
# An additional reward function encouraged structured <reasoning> and <answer> tags.
This reward structure directly addressed the "stubbed toe" problem, teaching the model that a stable patient with a non-critical injury should not be triaged as an emergency, even if they report pain.
The Proof: Evaluation Results
Our GRPO-aligned ZeroTime-Bot significantly improved triage accuracy compared to the base Llama-3.1 model:
| Patient Scenario | Base Llama-3.1 | ZeroTime-Bot (GRPO) | Status |
|---|---|---|---|
| Stubbed Toe (Walks, Redness) | Level 1 (Emergency) | Level 3 (Non-Urgent) | โ Fixed Bias |
| Chest Pain (Shortness of breath) | Level 1 (Emergency) | Level 1 (Emergency) | โ Kept Safety |
| Minor Fever (Stable Vitals) | Level 2 (Urgent) | Level 3 (Non-Urgent) | โ Fixed Bias |
๐ Experience ZeroTime-Bot Yourself!
You can test ZeroTime-Bot locally on your own machine:
- Install Ollama: Download and install Ollama from ollama.com.
- Download Model Weights: Get the
medical_triage_q4_k_m.gguffile from the "Files and versions" tab of our Hugging Face Repository. - Prepare Modelfile: In the same directory as the GGUF, create a
Modelfilewith the following content (update theFROMpath to your downloaded GGUF):FROM /Users/yourname/path/to/Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf PARAMETER temperature 0.1 SYSTEM """ You are a professional medical triage officer. Assign a level: 1 (Emergency), 2 (Urgent), or 3 (Non-Urgent). Be precise and do not over-triage minor injuries. """ - Create and Run: Open your terminal in the same directory and execute:
ollama create medicalbot -f Modelfile ollama run medicalbot
๐ Project Links
- Hugging Face Model: hjogidasani/medical-triage-llama-3.1-8b
- GitHub Repository: DevCraft89/MedicalBot