ZYao720 commited on
Commit
186e083
Β·
verified Β·
1 Parent(s): a70ec0b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +170 -6
README.md CHANGED
@@ -1,16 +1,180 @@
1
  ---
 
 
2
  license: apache-2.0
3
  library_name: transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Coming Soon
7
 
8
- This model will be released shortly. Stay tuned!
9
 
10
- **WebArbiter-4B-Qwen3** β€” Our efficient model, achieving **72.55% Avg. BoN Accuracy** on WebPRMBench.
11
 
12
- **Paper**: [WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents](https://arxiv.org/abs/2601.21872)
13
 
14
- **Code**: [GitHub](https://github.com/YaoZhang720/WebArbiter)
15
 
16
- **Website**: [yaozhang.ai/WebArbiter](https://yaozhang.ai/WebArbiter/)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
  license: apache-2.0
5
  library_name: transformers
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - web-agent
9
+ - process-reward-model
10
+ - preference
11
+ - reward-model
12
+ - web-navigation
13
+ - reasoning
14
+ - grpo
15
+ base_model: Qwen/Qwen3-4B
16
+ datasets:
17
+ - ZYao720/WebArbiter-Data
18
+ model-index:
19
+ - name: WebArbiter-4B-Qwen3
20
+ results:
21
+ - task:
22
+ type: text-generation
23
+ name: Web Process Reward Modeling
24
+ dataset:
25
+ name: WebPRMBench
26
+ type: ZYao720/WEBPRMBENCH
27
+ metrics:
28
+ - name: Avg Pairwise Accuracy
29
+ type: accuracy
30
+ value: 87.73
31
+ - name: Avg BoN Accuracy
32
+ type: accuracy
33
+ value: 72.55
34
  ---
35
 
36
+ <div align="center">
37
 
38
+ # WebArbiter-4B-Qwen3
39
 
40
+ **A principle-guided reasoning Process Reward Model for web agents**
41
 
42
+ **Published at ICLR 2026**
43
 
44
+ [Paper](https://arxiv.org/abs/2601.21872) | [Code](https://github.com/YaoZhang720/WebArbiter) | [Website](https://yaozhang.ai/WebArbiter/) | [Collection](https://huggingface.co/collections/ZYao720/ZYao720-69cd5263871b22e11d90f80f) | [Demo](https://yaozhang.ai/WebArbiter/demo.html)
45
 
46
+ </div>
47
+
48
+ ## Introduction
49
+
50
+ **WebArbiter-4B-Qwen3** is a 4B reasoning Process Reward Model (PRM) for web agents, built on [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B). It demonstrates that stronger base models amplify the benefits of principle-guided reasoning distillation β€” achieving an **Avg. BoN Acc of 72.55%** with roughly half the parameters of WebArbiter-7B (Qwen2.5), which scores 74.60%.
51
+
52
+ Unlike scalar or checklist-based reward models, WebArbiter formulates step-level reward modeling as structured text generation β€” producing interpretable, principle-inducing justifications that conclude with a preference verdict identifying the action most conducive to task completion.
53
+
54
+ ## Highlights
55
+
56
+ - **Parameter-efficient**: Approaches WebArbiter-7B (Qwen2.5) performance (72.55 vs 74.60 Avg. BoN Acc) with roughly half the parameters.
57
+ - **Reasoning as reward**: Generates structured `<State>`, `<Criteria>`, `<Analysis>`, and `<Answer>` outputs with auditable reasoning chains.
58
+ - **Principle-inducing evaluation**: Dynamically derives evaluation principles from user intent and page state.
59
+ - **Two-stage training**: Reasoning distillation from o3 (SFT) followed by RL with Verifiable Rewards (GRPO).
60
+ - **Cross-backbone generalization**: Same training pipeline as Qwen2.5 variants; only backbone-specific hyperparameters differ.
61
+
62
+ ## Results on WebPRMBench
63
+
64
+ Models marked with ⋆ are ours. **Bold** = best at comparable scale.
65
+
66
+ | Model | Mind2Web | | WebArena | | AssistantBench | | WorkArena | | Avg. | |
67
+ |-------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
68
+ | | Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN |
69
+ | *Proprietary LLM-as-judge* | | | | | | | | | | |
70
+ | GPT-4o | 79.99 | 52.62 | 84.58 | 66.67 | 85.83 | 66.67 | 84.33 | 55.19 | 83.68 | 60.29 |
71
+ | GPT-5 | 80.86 | 62.39 | 84.83 | 71.64 | 81.67 | 63.33 | 81.14 | 64.62 | 82.13 | 65.50 |
72
+ | *WebPRMs (3~4B)* | | | | | | | | | | |
73
+ | WebShepherd-3B | 87.50 | 65.21 | 68.16 | 41.29 | 66.67 | 46.67 | 50.00 | 21.23 | 68.08 | 43.60 |
74
+ | ⋆ WebArbiter-3B (Qwen2.5) | 93.32 | 78.42 | 81.97 | 56.22 | 78.33 | 46.67 | 81.01 | 54.81 | 83.65 | 59.06 |
75
+ | ⋆ **WebArbiter-4B (Qwen3)** | **98.55** | **94.73** | **83.21** | **61.19** | **92.50** | **83.33** | 76.68 | 50.96 | **87.73** | **72.55** |
76
+
77
+ WebArbiter-4B (Qwen3) substantially outperforms WebArbiter-3B (Qwen2.5) across all environments, improving Avg. BoN Acc from 59.06% to 72.55%. This approaches WebArbiter-7B (Qwen2.5) at 74.60% with roughly half the parameters.
78
+
79
+ ## Quick Start
80
+
81
+ ```python
82
+ import torch
83
+ from transformers import AutoModelForCausalLM, AutoTokenizer
84
+
85
+ model_name = "ZYao720/WebArbiter-4B-Qwen3"
86
+
87
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
88
+ model = AutoModelForCausalLM.from_pretrained(
89
+ model_name,
90
+ torch_dtype=torch.bfloat16,
91
+ device_map="auto",
92
+ trust_remote_code=True,
93
+ )
94
+
95
+ # Construct your prompt following the WebPRMBench format.
96
+ # See https://huggingface.co/datasets/ZYao720/WEBPRMBENCH for examples.
97
+ user_prompt = "..." # evaluation prompt with intent, AXTree, trajectory, two responses
98
+
99
+ messages = [{"role": "user", "content": user_prompt}]
100
+ input_ids = tokenizer.apply_chat_template(
101
+ messages, tokenize=True, add_generation_prompt=True, return_tensors="pt",
102
+ ).to(model.device)
103
+
104
+ with torch.no_grad():
105
+ output = model.generate(input_ids=input_ids, max_new_tokens=2048, do_sample=False)
106
+
107
+ response = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
108
+ print(response)
109
+ ```
110
+
111
+ **Example output:**
112
+ ```xml
113
+ <State>The user is on the DuckDuckGo homepage with a search box visible.
114
+ Relevant AXTree elements: [1] textbox 'Search', [2] button 'Search'.</State>
115
+ <Criteria>1. Goal alignment (weight 0.6) β€” Does the action advance the search task?
116
+ 2. Element reference accuracy (weight 0.25) β€” Is the referenced element correct?
117
+ 3. Efficiency (weight 0.15) β€” Does the action avoid unnecessary steps?</Criteria>
118
+ <Analysis>Response 1 directly fills the search query into the textbox, which is the
119
+ most direct path to completing the search task. Response 2 clicks an irrelevant link
120
+ that does not contribute to the search goal.</Analysis>
121
+ <Answer>Response 1</Answer>
122
+ ```
123
+
124
+ ## Training Details
125
+
126
+ | | Stage 1: Reasoning Distillation | Stage 2: RLVR |
127
+ |---|---|---|
128
+ | Method | Supervised fine-tuning (SFT) | GRPO with binary verifiable rewards |
129
+ | Data | 9,642 teacher-distilled examples | 18,921 preference pairs |
130
+ | Teacher | o3 | β€” |
131
+ | Base Model | [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) | Stage 1 checkpoint |
132
+ | Fine-tuning | LoRA | FSDP + LoRA |
133
+ | Framework | [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) | [veRL](https://github.com/volcengine/verl) |
134
+ | Hardware | 8 Γ— NVIDIA A100-80GB | 8 Γ— NVIDIA A100-80GB |
135
+ | Source Data | [WebPRM Collection](https://huggingface.co/datasets/LangAGI-Lab/WebPRMCollection_preference_pair) (~30k step-level preference pairs from Mind2Web) |
136
+
137
+ All variants use the same training data, distillation strategy, and RL procedure; only backbone-specific hyperparameters differ. See the [paper](https://arxiv.org/abs/2601.21872) (Appendix C) for full details.
138
+
139
+ ## Intended Uses
140
+
141
+ WebArbiter-4B-Qwen3 is designed to:
142
+ - **Evaluate web agent actions**: Given a web state and two candidate actions, determine which better advances the user's task.
143
+ - **Guide trajectory search**: Serve as a reward signal for Best-of-N sampling or tree search during web agent execution.
144
+ - **Provide interpretable feedback**: Generate structured justifications explaining why one action is preferred.
145
+ - **Resource-efficient deployment**: Strong performance at 4B parameters β€” approaching 7B-level accuracy with roughly half the parameters.
146
+
147
+ ## Limitations
148
+
149
+ - **Text-only observations**: Relies on accessibility tree representations without visual observations.
150
+ - **English-only**: Training and evaluation are conducted exclusively in English-language web environments.
151
+ - **Safe-action bias**: May sometimes overvalue cautious actions because the accessibility tree does not encode interaction effects.
152
+
153
+ ## License
154
+
155
+ This model is released under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0), following the base model [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B).
156
+
157
+ ## Related Resources
158
+
159
+ | Resource | Link |
160
+ |----------|------|
161
+ | WebArbiter-8B-Qwen3 (strongest) | [ZYao720/WebArbiter-8B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-8B-Qwen3) |
162
+ | WebArbiter-7B (Qwen2.5) | [ZYao720/WebArbiter-7B](https://huggingface.co/ZYao720/WebArbiter-7B) |
163
+ | WebArbiter-3B (Qwen2.5) | [ZYao720/WebArbiter-3B](https://huggingface.co/ZYao720/WebArbiter-3B) |
164
+ | WEBPRMBENCH (benchmark) | [ZYao720/WEBPRMBENCH](https://huggingface.co/datasets/ZYao720/WEBPRMBENCH) |
165
+ | Training Data | [ZYao720/WebArbiter-Data](https://huggingface.co/datasets/ZYao720/WebArbiter-Data) |
166
+ | Search Trajectories | [ZYao720/WebArbiter-Trajectories](https://huggingface.co/datasets/ZYao720/WebArbiter-Trajectories) |
167
+
168
+ ## Citation
169
+
170
+ ```bibtex
171
+ @misc{zhang2026ZYao720principleguidedreasoningprocess,
172
+ title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents},
173
+ author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp},
174
+ year={2026},
175
+ eprint={2601.21872},
176
+ archivePrefix={arXiv},
177
+ primaryClass={cs.AI},
178
+ url={https://arxiv.org/abs/2601.21872},
179
+ }
180
+ ```