Instructions to use y-ohtani/qwen3-4b-grpo-tcr-agent with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use y-ohtani/qwen3-4b-grpo-tcr-agent with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="y-ohtani/qwen3-4b-grpo-tcr-agent")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("y-ohtani/qwen3-4b-grpo-tcr-agent")
model = AutoModelForCausalLM.from_pretrained("y-ohtani/qwen3-4b-grpo-tcr-agent")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use y-ohtani/qwen3-4b-grpo-tcr-agent with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "y-ohtani/qwen3-4b-grpo-tcr-agent"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "y-ohtani/qwen3-4b-grpo-tcr-agent",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/y-ohtani/qwen3-4b-grpo-tcr-agent

SGLang

How to use y-ohtani/qwen3-4b-grpo-tcr-agent with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "y-ohtani/qwen3-4b-grpo-tcr-agent" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "y-ohtani/qwen3-4b-grpo-tcr-agent",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "y-ohtani/qwen3-4b-grpo-tcr-agent" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "y-ohtani/qwen3-4b-grpo-tcr-agent",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use y-ohtani/qwen3-4b-grpo-tcr-agent with Docker Model Runner:
```
docker model run hf.co/y-ohtani/qwen3-4b-grpo-tcr-agent
```

Qwen3-4B-GRPO-TCR-Agent

This repository contains a full fine-tuned model based on Qwen3-4B-Instruct-2507, trained in two stages:

SFT (Supervised Fine-Tuning) — multi-turn agentic cold-start
GRPO-TCR (Group Relative Policy Optimization with Tool Call Reward) — reinforcement learning

Training uses the Open-AgentRL framework based on the DemyAgent methodology.

Training Objective

This model is trained to perform deliberative agentic reasoning — selectively calling a code_interpreter tool across multiple turns to solve math and coding problems, rather than relying on verbose self-reasoning or indiscriminate tool calls.

The GRPO-TCR stage reinforces:

Correct final answers (outcome reward)
Tool usage attempts even on incorrect answers (Tool Call Reward)
Concise responses (overlong penalty)

Training Pipeline

Qwen3-4B-Instruct-2507
  │
  ├── Stage 1: SFT (multi-turn agentic cold-start)
  │     Dataset: y-ohtani/open_agentrl_like_sft (2K samples, Apache-2.0)
  │     Epochs: 10, Max length: 32768, Full fine-tuning (FSDP, bfloat16)
  │
  └── Stage 2: GRPO-TCR (this model)
        Dataset: y-ohtani/open_agentrl_grpo_2k (2K balanced sample, Apache-2.0)
        Epochs: 3, Algorithm: GRPO + 5 enhancements (see below)

GRPO-TCR Configuration

Parameter	Value
Base (SFT model)	Stage 1 SFT-merged checkpoint (epoch 3)
Algorithm	GRPO (Group Relative Policy Optimization)
Max prompt length	2,560
Max response length	10,480
Max turns	16
Learning rate	1e-6
Train batch size	4
Responses per prompt	8
Epochs	3
Loss aggregation	token-mean
Clip ratio	low=0.2, high=0.28 (asymmetric)
KL divergence	Disabled (kl_coef=0.0)
Overlong penalty	buffer=3000, factor=1.0
Rollout engine	vLLM (sync mode)
Tool format	Hermes

5 Key Enhancements over Standard GRPO

Enhancement	Purpose
Multi-turn tool calling	Enable agentic reasoning (up to 16 turns)
TCR (Tool Call Reward)	Reward tool usage even on wrong answers to prevent exploration collapse
Asymmetric clipping	Promote exploration by allowing larger probability increases
Overlong penalty	Suppress verbose responses, encourage efficient tool use
KL removal + token-mean	Allow free exploration without reference model constraint

Dataset (RL Stage)

Name: y-ohtani/open_agentrl_grpo_2k
License: Apache-2.0 (all sources are Apache-2.0 or MIT)
Sampling: 2,000 samples, balanced across 5 sources (400 each)

Source	Original Dataset	License	Domain
deepscaler	agentica-org/DeepScaleR-Preview-Dataset	MIT	Math (reasoning)
omni_math	KbsdJames/Omni-MATH	Apache-2.0	Math (olympiad)
numina_math	AI-MO/NuminaMath-1.5	Apache-2.0	Math (general)
taco_code	BAAI/TACO	Apache-2.0	Coding (algorithm)
leetcode_code	newfacade/LeetCodeDataset	Apache-2.0	Coding (LeetCode)

All training data is sourced from Apache-2.0 / MIT licensed open datasets. This repository does NOT redistribute the dataset.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "y-ohtani/qwen3-4b-grpo-tcr-agent"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Find all prime numbers p such that p^2 + 2 is also prime."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Sources & Terms

Component	Source	License
Base model	Qwen/Qwen3-4B-Instruct-2507	Apache-2.0
SFT dataset	y-ohtani/open_agentrl_like_sft	Apache-2.0
RL dataset	y-ohtani/open_agentrl_grpo_2k	Apache-2.0
Training framework	Open-AgentRL (verl)	Apache-2.0
Methodology	DemyAgent (arXiv:2507.15997)	—

Users must comply with the base model license and dataset terms.

Intended Use & Limitations

Intended: Agentic reasoning tasks with tool use (math, coding). Best results when used with a code interpreter tool in multi-turn settings.
Not intended: Production deployment without further evaluation.
Limitations:
- Trained on 2K balanced subset (not full 18K dataset) for efficiency.
- Performance on non-math/non-coding tasks may degrade compared to the base instruct model.
- Tool calling requires a compatible runtime (e.g., SandboxFusion, code interpreter).

Downloads last month: 4

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for y-ohtani/qwen3-4b-grpo-tcr-agent

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

(1725)

this model

Datasets used to train y-ohtani/qwen3-4b-grpo-tcr-agent

Paper for y-ohtani/qwen3-4b-grpo-tcr-agent

"We Need a Standard": Toward an Expert-Informed Privacy Label for Differential Privacy

Paper • 2507.15997 • Published Jul 21, 2025