Jarrodbarnes commited on
Commit
9902d3e
·
verified ·
1 Parent(s): c32c412

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +108 -8
README.md CHANGED
@@ -12,14 +12,41 @@ tags:
12
  datasets:
13
  - Jarrodbarnes/tau2-sft-seed-v3
14
  pipeline_tag: text-generation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
  # Qwen3-4B-tau2-grpo-v1
18
 
19
- A 4B parameter model fine-tuned for multi-turn tool-use tasks, achieving **59% Pass@4** on tau2-bench (test split). This represents a **4x improvement** over the base model.
 
 
 
 
 
 
 
 
 
20
 
21
  ## Performance
22
 
 
 
23
  | Domain | Pass@1 | Pass@4 | Tasks |
24
  |--------|--------|--------|-------|
25
  | **Overall** | **36.0%** | **59.0%** | 100 |
@@ -27,19 +54,92 @@ A 4B parameter model fine-tuned for multi-turn tool-use tasks, achieving **59% P
27
  | Retail | 55.0% | 85.0% | 40 |
28
  | Telecom | 27.5% | 40.0% | 40 |
29
 
30
- ## Training
31
 
32
- Three-stage pipeline: SFT -> RFT -> GRPO. See [training cookbook](https://github.com/THUDM/slime/blob/main/examples/tau-bench/training_cookbook.md).
 
 
 
 
 
 
33
 
34
  ## Usage
35
 
 
 
36
  ```bash
37
- python -m sglang.launch_server --model-path Jarrodbarnes/Qwen3-4B-tau2-grpo-v1 --host 0.0.0.0 --port 30000 --tp 2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  ```
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  ## Resources
41
 
42
- - [Training Cookbook](https://github.com/THUDM/slime/blob/main/examples/tau-bench/training_cookbook.md)
43
- - [SFT Checkpoint](https://huggingface.co/Jarrodbarnes/Qwen3-4B-tau2-sft1)
44
- - [Training Dataset](https://huggingface.co/datasets/Jarrodbarnes/tau2-sft-seed-v3)
45
- - W&B: [SFT](https://wandb.ai/jbarnes850-near-protocol/tau2-cookbook/runs/b7d80rfe), [GRPO](https://wandb.ai/jbarnes850-near-protocol/tau2-cookbook/runs/pkeu9kck)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  datasets:
13
  - Jarrodbarnes/tau2-sft-seed-v3
14
  pipeline_tag: text-generation
15
+ model-index:
16
+ - name: Qwen3-4B-tau2-grpo-v1
17
+ results:
18
+ - task:
19
+ type: multi-turn-tool-use
20
+ name: tau2-bench
21
+ dataset:
22
+ type: tau2-bench
23
+ name: tau2-bench (test split)
24
+ metrics:
25
+ - type: pass@1
26
+ value: 36.0
27
+ name: Pass@1 (Overall)
28
+ - type: pass@4
29
+ value: 59.0
30
+ name: Pass@4 (Overall)
31
  ---
32
 
33
  # Qwen3-4B-tau2-grpo-v1
34
 
35
+ A 4B parameter model fine-tuned for multi-turn tool-use tasks, achieving **59% Pass@4** on tau2-bench (test split). This represents a **4x improvement** over the base model and demonstrates that progressive training (SFT -> RFT -> GRPO) works effectively for complex, multi-turn agent tasks.
36
+
37
+ ## Model Description
38
+
39
+ This model was trained using a three-stage pipeline:
40
+ 1. **SFT (Supervised Fine-Tuning)**: Learning protocol and tool schemas from successful trajectories
41
+ 2. **RFT (Rejection Fine-Tuning)**: Concentrating training on high-quality rollouts via rejection sampling
42
+ 3. **GRPO (Group Relative Policy Optimization)**: Reinforcement learning with turn-level reward shaping
43
+
44
+ The training process is documented in the [tau2 training cookbook](https://github.com/THUDM/slime/blob/main/examples/tau-bench/training_cookbook.md).
45
 
46
  ## Performance
47
 
48
+ ### tau2-bench Test Split (Pass@4 evaluation)
49
+
50
  | Domain | Pass@1 | Pass@4 | Tasks |
51
  |--------|--------|--------|-------|
52
  | **Overall** | **36.0%** | **59.0%** | 100 |
 
54
  | Retail | 55.0% | 85.0% | 40 |
55
  | Telecom | 27.5% | 40.0% | 40 |
56
 
57
+ ### Training Progression
58
 
59
+ | Stage | Overall Pass@4 |
60
+ |-------|----------------|
61
+ | Baseline (Qwen3-4B-Instruct) | 14.3% |
62
+ | SFT + RFT | 27.0% |
63
+ | GRPO (this model) | **59.0%** |
64
+
65
+ **Eval config**: `temperature=0.8`, `top_p=1.0`, `top_k=20`, `num_samples=4`, `TAU2_USER_MODEL=gpt-4.1-mini`, `TAU2_USER_TEMPERATURE=0.7`, `TAU2_MAX_STEPS=100`.
66
 
67
  ## Usage
68
 
69
+ ### With SGLang (recommended for evaluation)
70
+
71
  ```bash
72
+ # Start the server (use --tp 1 for single GPU)
73
+ python -m sglang.launch_server \
74
+ --model-path Jarrodbarnes/Qwen3-4B-tau2-grpo-v1 \
75
+ --host 0.0.0.0 --port 30000 --tp 2 --mem-fraction-static 0.70
76
+ ```
77
+
78
+ ### With Transformers
79
+
80
+ ```python
81
+ from transformers import AutoModelForCausalLM, AutoTokenizer
82
+
83
+ model_name = "Jarrodbarnes/Qwen3-4B-tau2-grpo-v1"
84
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
85
+ model = AutoModelForCausalLM.from_pretrained(
86
+ model_name,
87
+ torch_dtype="auto",
88
+ device_map="auto"
89
+ )
90
  ```
91
 
92
+ ### Function Calling Format
93
+
94
+ This model uses Qwen3 native function calling format:
95
+
96
+ ```
97
+ <tool_call>{"name": "tool_name", "arguments": {"arg": "value"}}</tool_call>
98
+ ```
99
+
100
+ Include `</tool_call>` in stop sequences for proper parsing.
101
+
102
+ ## Training Details
103
+
104
+ - **Base model**: [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
105
+ - **Training framework**: [SLIME](https://github.com/THUDM/slime) (Megatron-LM + SGLang)
106
+ - **SFT data**: [tau2-sft-seed-v3](https://huggingface.co/datasets/Jarrodbarnes/tau2-sft-seed-v3)
107
+ - **GRPO steps**: 21 optimizer steps
108
+ - **Reward shaping**: Turn-level partial scores from tau2-bench reward_info
109
+ - **User simulator (training)**: Local Qwen3-4B-Instruct on port 30001
110
+ - **User simulator (eval)**: GPT-4.1-mini via OpenAI API
111
+
112
+ ### W&B Training Logs
113
+
114
+ - SFT run: [b7d80rfe](https://wandb.ai/jbarnes850-near-protocol/tau2-cookbook/runs/b7d80rfe)
115
+ - GRPO run: [pkeu9kck](https://wandb.ai/jbarnes850-near-protocol/tau2-cookbook/runs/pkeu9kck)
116
+
117
  ## Resources
118
 
119
+ - [Training Cookbook](https://github.com/THUDM/slime/blob/main/examples/tau-bench/training_cookbook.md) - Full methodology and reproduction steps
120
+ - [SFT Checkpoint](https://huggingface.co/Jarrodbarnes/Qwen3-4B-tau2-sft1) - Intermediate SFT+RFT checkpoint
121
+ - [Training Dataset](https://huggingface.co/datasets/Jarrodbarnes/tau2-sft-seed-v3) - Filtered RFT trajectories
122
+ - [tau2-bench](https://github.com/sierra-research/tau2-bench) - Benchmark repository
123
+
124
+ ## Limitations
125
+
126
+ - **Telecom domain**: Dual-control tasks (where the agent must instruct rather than execute) remain challenging (40% Pass@4)
127
+ - **User simulator sensitivity**: Results vary with user simulator choice; GPT-4.1-mini recommended for reproducibility
128
+ - **Pass@k vs Pass^k**: This model reports pass@k (any success in k attempts), not the pass^k metric used on the official tau2-bench leaderboard
129
+
130
+ ## Citation
131
+
132
+ ```bibtex
133
+ @misc{qwen3-tau2-grpo,
134
+ title={Qwen3-4B-tau2-grpo-v1: Multi-Turn Tool-Use Agent via Progressive RL Training},
135
+ author={Jarrod Barnes},
136
+ year={2025},
137
+ url={https://huggingface.co/Jarrodbarnes/Qwen3-4B-tau2-grpo-v1}
138
+ }
139
+ ```
140
+
141
+ ## Acknowledgments
142
+
143
+ - [Qwen Team](https://github.com/QwenLM/Qwen3) for the base model
144
+ - [Sierra Research](https://github.com/sierra-research/tau2-bench) for tau2-bench
145
+ - [THUDM](https://github.com/THUDM/slime) for the SLIME training framework