Text Generation
Transformers
Safetensors
qwen3
tool-use
multi-turn
agents
grpo
reinforcement-learning
tau2-bench
conversational
Eval Results (legacy)
text-generation-inference
Instructions to use Jarrodbarnes/Qwen3-4B-tau2-grpo-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Jarrodbarnes/Qwen3-4B-tau2-grpo-v1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Jarrodbarnes/Qwen3-4B-tau2-grpo-v1") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Jarrodbarnes/Qwen3-4B-tau2-grpo-v1") model = AutoModelForCausalLM.from_pretrained("Jarrodbarnes/Qwen3-4B-tau2-grpo-v1") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Jarrodbarnes/Qwen3-4B-tau2-grpo-v1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Jarrodbarnes/Qwen3-4B-tau2-grpo-v1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jarrodbarnes/Qwen3-4B-tau2-grpo-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Jarrodbarnes/Qwen3-4B-tau2-grpo-v1
- SGLang
How to use Jarrodbarnes/Qwen3-4B-tau2-grpo-v1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Jarrodbarnes/Qwen3-4B-tau2-grpo-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jarrodbarnes/Qwen3-4B-tau2-grpo-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Jarrodbarnes/Qwen3-4B-tau2-grpo-v1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jarrodbarnes/Qwen3-4B-tau2-grpo-v1", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Jarrodbarnes/Qwen3-4B-tau2-grpo-v1 with Docker Model Runner:
docker model run hf.co/Jarrodbarnes/Qwen3-4B-tau2-grpo-v1
File size: 5,085 Bytes
dc84867 c32c412 dc84867 9902d3e dc84867 c32c412 dc84867 9902d3e dc84867 9902d3e c32c412 dc84867 9902d3e dc84867 9902d3e dc84867 c32c412 dc84867 9902d3e c32c412 9902d3e dc84867 9902d3e c32c412 dc84867 9902d3e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | ---
library_name: transformers
license: apache-2.0
base_model: Qwen/Qwen3-4B-Instruct-2507
tags:
- tool-use
- multi-turn
- agents
- grpo
- reinforcement-learning
- tau2-bench
datasets:
- Jarrodbarnes/tau2-sft-seed-v3
pipeline_tag: text-generation
model-index:
- name: Qwen3-4B-tau2-grpo-v1
results:
- task:
type: multi-turn-tool-use
name: tau2-bench
dataset:
type: tau2-bench
name: tau2-bench (test split)
metrics:
- type: pass@1
value: 36.0
name: Pass@1 (Overall)
- type: pass@4
value: 59.0
name: Pass@4 (Overall)
---
# Qwen3-4B-tau2-grpo-v1
A 4B parameter model fine-tuned for multi-turn tool-use tasks, achieving **59% Pass@4** on tau2-bench (test split). This represents a **4x improvement** over the base model and demonstrates that progressive training (SFT -> RFT -> GRPO) works effectively for complex, multi-turn agent tasks.
## Model Description
This model was trained using a three-stage pipeline:
1. **SFT (Supervised Fine-Tuning)**: Learning protocol and tool schemas from successful trajectories
2. **RFT (Rejection Fine-Tuning)**: Concentrating training on high-quality rollouts via rejection sampling
3. **GRPO (Group Relative Policy Optimization)**: Reinforcement learning with turn-level reward shaping
The training process is documented in the [tau2 training cookbook](https://github.com/THUDM/slime/blob/main/examples/tau-bench/training_cookbook.md).
## Performance
### tau2-bench Test Split (Pass@4 evaluation)
| Domain | Pass@1 | Pass@4 | Tasks |
|--------|--------|--------|-------|
| **Overall** | **36.0%** | **59.0%** | 100 |
| Airline | 15.0% | 45.0% | 20 |
| Retail | 55.0% | 85.0% | 40 |
| Telecom | 27.5% | 40.0% | 40 |
### Training Progression
| Stage | Overall Pass@4 |
|-------|----------------|
| Baseline (Qwen3-4B-Instruct) | 14.3% |
| SFT + RFT | 27.0% |
| GRPO (this model) | **59.0%** |
**Eval config**: `temperature=0.8`, `top_p=1.0`, `top_k=20`, `num_samples=4`, `TAU2_USER_MODEL=gpt-4.1-mini`, `TAU2_USER_TEMPERATURE=0.7`, `TAU2_MAX_STEPS=100`.
## Usage
### With SGLang (recommended for evaluation)
```bash
# Start the server (use --tp 1 for single GPU)
python -m sglang.launch_server \
--model-path Jarrodbarnes/Qwen3-4B-tau2-grpo-v1 \
--host 0.0.0.0 --port 30000 --tp 2 --mem-fraction-static 0.70
```
### With Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Jarrodbarnes/Qwen3-4B-tau2-grpo-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
```
### Function Calling Format
This model uses Qwen3 native function calling format:
```
<tool_call>{"name": "tool_name", "arguments": {"arg": "value"}}</tool_call>
```
Include `</tool_call>` in stop sequences for proper parsing.
## Training Details
- **Base model**: [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
- **Training framework**: [SLIME](https://github.com/THUDM/slime) (Megatron-LM + SGLang)
- **SFT data**: [tau2-sft-seed-v3](https://huggingface.co/datasets/Jarrodbarnes/tau2-sft-seed-v3)
- **GRPO steps**: 21 optimizer steps
- **Reward shaping**: Turn-level partial scores from tau2-bench reward_info
- **User simulator (training)**: Local Qwen3-4B-Instruct on port 30001
- **User simulator (eval)**: GPT-4.1-mini via OpenAI API
### W&B Training Logs
- SFT run: [b7d80rfe](https://wandb.ai/jbarnes850-near-protocol/tau2-cookbook/runs/b7d80rfe)
- GRPO run: [pkeu9kck](https://wandb.ai/jbarnes850-near-protocol/tau2-cookbook/runs/pkeu9kck)
## Resources
- [Training Cookbook](https://github.com/THUDM/slime/blob/main/examples/tau-bench/training_cookbook.md) - Full methodology and reproduction steps
- [SFT Checkpoint](https://huggingface.co/Jarrodbarnes/Qwen3-4B-tau2-sft1) - Intermediate SFT+RFT checkpoint
- [Training Dataset](https://huggingface.co/datasets/Jarrodbarnes/tau2-sft-seed-v3) - Filtered RFT trajectories
- [tau2-bench](https://github.com/sierra-research/tau2-bench) - Benchmark repository
## Limitations
- **Telecom domain**: Dual-control tasks (where the agent must instruct rather than execute) remain challenging (40% Pass@4)
- **User simulator sensitivity**: Results vary with user simulator choice; GPT-4.1-mini recommended for reproducibility
- **Pass@k vs Pass^k**: This model reports pass@k (any success in k attempts), not the pass^k metric used on the official tau2-bench leaderboard
## Citation
```bibtex
@misc{qwen3-tau2-grpo,
title={Qwen3-4B-tau2-grpo-v1: Multi-Turn Tool-Use Agent via Progressive RL Training},
author={Jarrod Barnes},
year={2025},
url={https://huggingface.co/Jarrodbarnes/Qwen3-4B-tau2-grpo-v1}
}
```
## Acknowledgments
- [Qwen Team](https://github.com/QwenLM/Qwen3) for the base model
- [Sierra Research](https://github.com/sierra-research/tau2-bench) for tau2-bench
- [THUDM](https://github.com/THUDM/slime) for the SLIME training framework
|