ZeroTime-Bot: Medical Triage Alignment with GRPO and Unsloth

Community Article Published January 16, 2026

From "Over-Triage" to Clinical Accuracy

The Challenge: Taming AI's Safety Bias in Medical Triage

In critical sectors like healthcare, AI's inherent "safety-first" bias often leads to over-triage. A model, eager not to miss a critical condition, might classify even a minor injury (like a stubbed toe) as an "Emergency" (Level 1). While seemingly benign, this "triage inflation" can dangerously overburden emergency rooms, diverting resources from truly critical patients.

Our goal with ZeroTime-Bot was to create an AI medical triage agent that:

  1. Maintains patient safety for genuine emergencies.
  2. Accurately de-escalates minor injuries to appropriate non-urgent levels.
  3. Provides transparent reasoning for its decisions.

The Solution: Unleashing GRPO with Unsloth

To tackle this challenge, we leveraged a powerful combination of technologies:

  • Unsloth: For efficient and significantly faster fine-tuning of Large Language Models (LLMs). Unsloth's optimizations were critical in iterating quickly on our RL experiments within Google Colab's High-RAM environment.
  • GRPO (Group Relative Policy Optimization): A cutting-edge Reinforcement Learning (RL) algorithm designed for LLM alignment. Unlike traditional Supervised Fine-Tuning (SFT) which teaches imitation, GRPO enables the model to reason through a reward-based feedback loop.

Why GRPO for Medical Triage?

GRPO offers several key advantages for this specific task:

  1. "Critic-Less" RL: It sidesteps the need for a separate Value (Critic) model, which drastically reduces VRAM requirements and training time. This allowed us to align a robust Llama-3.1 8B model effectively.
  2. Comparative Learning: For each medical scenario, GRPO generates a "group" of possible reasoning paths and triage decisions. It then compares these generated responses against a human-defined reward function.
  3. Targeted Alignment: This comparative approach allowed us to specifically penalize "over-triage" while rewarding clinically sound and safe decisions.

Our Training Journey: Overcoming Colab's Challenges

The path to ZeroTime-Bot wasn't without its hurdles:

  • Colab Runtime Instability: Initial attempts faced frequent NameError and ModuleNotFoundError due to Colab runtime resets and library version mismatches.
  • High-RAM Requirement: The 16-bit merge phase of the 8B Llama model required a dedicated High-RAM Colab Pro instance to prevent memory crashes.
  • GGUF Export Issues: Ensuring the final 5GB .gguf file correctly synced to Google Drive for Hugging Face upload required specific shutil.copy commands to bypass Colab's syncing latency.
  • Ollama Integration: Getting the model running locally on Mac via Ollama involved precise Modelfile creation and troubleshooting invalid model name errors, reinforcing the importance of clean terminal commands.

The Heart of Alignment: Our Reward Function

The core of GRPO's success lay in our custom reward functions, especially the correctness_reward_func. This function wasn't just about matching an exact answer; it was about shaping the model's judgment:

def correctness_reward_func(prompts, completions, answer, **kwargs):
    """
    Rewards the model for accurately assigning triage levels (1, 2, or 3),
    with a specific focus on penalizing over-triage for minor conditions.
    """
    responses = [completion[0]['content'] for completion in completions]
    extracted = [re.search(r"<answer>(.*?)</answer>", r).group(1).strip() if re.search(r"<answer>(.*?)</answer>", r) else "" for r in responses]
    
    rewards = []
    for ext, ans in zip(extracted, answer):
        # High reward for correct match, especially for de-escalation
        if str(ext) == str(ans):
            rewards.append(2.0)
        # Moderate penalty for over-triage (e.g., Level 3 becomes Level 1/2)
        elif int(ext) < int(ans): # If the model triaged higher than it should have
            rewards.append(-1.0)
        else: # Incorrect or under-triage
            rewards.append(0.0) # Or a smaller penalty
    return rewards

# An additional reward function encouraged structured <reasoning> and <answer> tags.

This reward structure directly addressed the "stubbed toe" problem, teaching the model that a stable patient with a non-critical injury should not be triaged as an emergency, even if they report pain.

The Proof: Evaluation Results

Our GRPO-aligned ZeroTime-Bot significantly improved triage accuracy compared to the base Llama-3.1 model:

Patient Scenario Base Llama-3.1 ZeroTime-Bot (GRPO) Status
Stubbed Toe (Walks, Redness) Level 1 (Emergency) Level 3 (Non-Urgent) โœ… Fixed Bias
Chest Pain (Shortness of breath) Level 1 (Emergency) Level 1 (Emergency) โœ… Kept Safety
Minor Fever (Stable Vitals) Level 2 (Urgent) Level 3 (Non-Urgent) โœ… Fixed Bias

๐Ÿš€ Experience ZeroTime-Bot Yourself!

You can test ZeroTime-Bot locally on your own machine:

  1. Install Ollama: Download and install Ollama from ollama.com.
  2. Download Model Weights: Get the medical_triage_q4_k_m.gguf file from the "Files and versions" tab of our Hugging Face Repository.
  3. Prepare Modelfile: In the same directory as the GGUF, create a Modelfile with the following content (update the FROM path to your downloaded GGUF):
    FROM /Users/yourname/path/to/Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf
    
    PARAMETER temperature 0.1
    
    SYSTEM """
    You are a professional medical triage officer. 
    Assign a level: 1 (Emergency), 2 (Urgent), or 3 (Non-Urgent).
    Be precise and do not over-triage minor injuries.
    """
    
  4. Create and Run: Open your terminal in the same directory and execute:
    ollama create medicalbot -f Modelfile
    ollama run medicalbot
    

๐Ÿ”— Project Links


Community

Sign up or log in to comment