Qwen3-4B-GRPO-TCR-Agent

This repository contains a full fine-tuned model based on Qwen3-4B-Instruct-2507, trained in two stages:

  1. SFT (Supervised Fine-Tuning) — multi-turn agentic cold-start
  2. GRPO-TCR (Group Relative Policy Optimization with Tool Call Reward) — reinforcement learning

Training uses the Open-AgentRL framework based on the DemyAgent methodology.

Training Objective

This model is trained to perform deliberative agentic reasoning — selectively calling a code_interpreter tool across multiple turns to solve math and coding problems, rather than relying on verbose self-reasoning or indiscriminate tool calls.

The GRPO-TCR stage reinforces:

  • Correct final answers (outcome reward)
  • Tool usage attempts even on incorrect answers (Tool Call Reward)
  • Concise responses (overlong penalty)

Training Pipeline

Qwen3-4B-Instruct-2507
  │
  ├── Stage 1: SFT (multi-turn agentic cold-start)
  │     Dataset: y-ohtani/open_agentrl_like_sft (2K samples, Apache-2.0)
  │     Epochs: 10, Max length: 32768, Full fine-tuning (FSDP, bfloat16)
  │
  └── Stage 2: GRPO-TCR (this model)
        Dataset: y-ohtani/open_agentrl_grpo_2k (2K balanced sample, Apache-2.0)
        Epochs: 3, Algorithm: GRPO + 5 enhancements (see below)

GRPO-TCR Configuration

Parameter Value
Base (SFT model) Stage 1 SFT-merged checkpoint (epoch 3)
Algorithm GRPO (Group Relative Policy Optimization)
Max prompt length 2,560
Max response length 10,480
Max turns 16
Learning rate 1e-6
Train batch size 4
Responses per prompt 8
Epochs 3
Loss aggregation token-mean
Clip ratio low=0.2, high=0.28 (asymmetric)
KL divergence Disabled (kl_coef=0.0)
Overlong penalty buffer=3000, factor=1.0
Rollout engine vLLM (sync mode)
Tool format Hermes

5 Key Enhancements over Standard GRPO

Enhancement Purpose
Multi-turn tool calling Enable agentic reasoning (up to 16 turns)
TCR (Tool Call Reward) Reward tool usage even on wrong answers to prevent exploration collapse
Asymmetric clipping Promote exploration by allowing larger probability increases
Overlong penalty Suppress verbose responses, encourage efficient tool use
KL removal + token-mean Allow free exploration without reference model constraint

Dataset (RL Stage)

  • Name: y-ohtani/open_agentrl_grpo_2k
  • License: Apache-2.0 (all sources are Apache-2.0 or MIT)
  • Sampling: 2,000 samples, balanced across 5 sources (400 each)
Source Original Dataset License Domain
deepscaler agentica-org/DeepScaleR-Preview-Dataset MIT Math (reasoning)
omni_math KbsdJames/Omni-MATH Apache-2.0 Math (olympiad)
numina_math AI-MO/NuminaMath-1.5 Apache-2.0 Math (general)
taco_code BAAI/TACO Apache-2.0 Coding (algorithm)
leetcode_code newfacade/LeetCodeDataset Apache-2.0 Coding (LeetCode)

All training data is sourced from Apache-2.0 / MIT licensed open datasets. This repository does NOT redistribute the dataset.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "y-ohtani/qwen3-4b-grpo-tcr-agent"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Find all prime numbers p such that p^2 + 2 is also prime."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Sources & Terms

Component Source License
Base model Qwen/Qwen3-4B-Instruct-2507 Apache-2.0
SFT dataset y-ohtani/open_agentrl_like_sft Apache-2.0
RL dataset y-ohtani/open_agentrl_grpo_2k Apache-2.0
Training framework Open-AgentRL (verl) Apache-2.0
Methodology DemyAgent (arXiv:2507.15997) —

Users must comply with the base model license and dataset terms.

Intended Use & Limitations

  • Intended: Agentic reasoning tasks with tool use (math, coding). Best results when used with a code interpreter tool in multi-turn settings.
  • Not intended: Production deployment without further evaluation.
  • Limitations:
    • Trained on 2K balanced subset (not full 18K dataset) for efficiency.
    • Performance on non-math/non-coding tasks may degrade compared to the base instruct model.
    • Tool calling requires a compatible runtime (e.g., SandboxFusion, code interpreter).
Downloads last month
4
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for y-ohtani/qwen3-4b-grpo-tcr-agent

Finetuned
(1725)
this model

Datasets used to train y-ohtani/qwen3-4b-grpo-tcr-agent

Paper for y-ohtani/qwen3-4b-grpo-tcr-agent