Instructions to use ZYao720/WebArbiter-4B-Qwen3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ZYao720/WebArbiter-4B-Qwen3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ZYao720/WebArbiter-4B-Qwen3")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("ZYao720/WebArbiter-4B-Qwen3")
model = AutoModelForMultimodalLM.from_pretrained("ZYao720/WebArbiter-4B-Qwen3")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ZYao720/WebArbiter-4B-Qwen3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ZYao720/WebArbiter-4B-Qwen3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ZYao720/WebArbiter-4B-Qwen3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ZYao720/WebArbiter-4B-Qwen3

SGLang

How to use ZYao720/WebArbiter-4B-Qwen3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ZYao720/WebArbiter-4B-Qwen3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ZYao720/WebArbiter-4B-Qwen3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ZYao720/WebArbiter-4B-Qwen3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ZYao720/WebArbiter-4B-Qwen3",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ZYao720/WebArbiter-4B-Qwen3 with Docker Model Runner:
```
docker model run hf.co/ZYao720/WebArbiter-4B-Qwen3
```

ZYao720 commited on Apr 9

Commit

186e083

verified ·

1 Parent(s): a70ec0b

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +170 -6

README.md CHANGED Viewed

@@ -1,16 +1,180 @@
 ---
 license: apache-2.0
 library_name: transformers
 ---
-# Coming Soon
-This model will be released shortly. Stay tuned!
-**WebArbiter-4B-Qwen3** — Our efficient model, achieving **72.55% Avg. BoN Accuracy** on WebPRMBench.
-**Paper**: [WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents](https://arxiv.org/abs/2601.21872)
-**Code**: [GitHub](https://github.com/YaoZhang720/WebArbiter)
-**Website**: [yaozhang.ai/WebArbiter](https://yaozhang.ai/WebArbiter/)

 ---
+language:
+- en
 license: apache-2.0
 library_name: transformers
+pipeline_tag: text-generation
+tags:
+- web-agent
+- process-reward-model
+- preference
+- reward-model
+- web-navigation
+- reasoning
+- grpo
+base_model: Qwen/Qwen3-4B
+datasets:
+- ZYao720/WebArbiter-Data
+model-index:
+- name: WebArbiter-4B-Qwen3
+  results:
+  - task:
+      type: text-generation
+      name: Web Process Reward Modeling
+    dataset:
+      name: WebPRMBench
+      type: ZYao720/WEBPRMBENCH
+    metrics:
+    - name: Avg Pairwise Accuracy
+      type: accuracy
+      value: 87.73
+    - name: Avg BoN Accuracy
+      type: accuracy
+      value: 72.55
 ---
+<div align="center">
+# WebArbiter-4B-Qwen3
+**A principle-guided reasoning Process Reward Model for web agents**
+**Published at ICLR 2026**
+[Paper](https://arxiv.org/abs/2601.21872) | [Code](https://github.com/YaoZhang720/WebArbiter) | [Website](https://yaozhang.ai/WebArbiter/) | [Collection](https://huggingface.co/collections/ZYao720/ZYao720-69cd5263871b22e11d90f80f) | [Demo](https://yaozhang.ai/WebArbiter/demo.html)
+</div>
+## Introduction
+**WebArbiter-4B-Qwen3** is a 4B reasoning Process Reward Model (PRM) for web agents, built on [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B). It demonstrates that stronger base models amplify the benefits of principle-guided reasoning distillation — achieving an **Avg. BoN Acc of 72.55%** with roughly half the parameters of WebArbiter-7B (Qwen2.5), which scores 74.60%.
+Unlike scalar or checklist-based reward models, WebArbiter formulates step-level reward modeling as structured text generation — producing interpretable, principle-inducing justifications that conclude with a preference verdict identifying the action most conducive to task completion.
+## Highlights
+- **Parameter-efficient**: Approaches WebArbiter-7B (Qwen2.5) performance (72.55 vs 74.60 Avg. BoN Acc) with roughly half the parameters.
+- **Reasoning as reward**: Generates structured `<State>`, `<Criteria>`, `<Analysis>`, and `<Answer>` outputs with auditable reasoning chains.
+- **Principle-inducing evaluation**: Dynamically derives evaluation principles from user intent and page state.
+- **Two-stage training**: Reasoning distillation from o3 (SFT) followed by RL with Verifiable Rewards (GRPO).
+- **Cross-backbone generalization**: Same training pipeline as Qwen2.5 variants; only backbone-specific hyperparameters differ.
+## Results on WebPRMBench
+Models marked with ⋆ are ours. **Bold** = best at comparable scale.
+| Model | Mind2Web | | WebArena | | AssistantBench | | WorkArena | | Avg. | |
+|-------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+| | Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN | Pair | BoN |
+| *Proprietary LLM-as-judge* | | | | | | | | | | |
+| GPT-4o | 79.99 | 52.62 | 84.58 | 66.67 | 85.83 | 66.67 | 84.33 | 55.19 | 83.68 | 60.29 |
+| GPT-5 | 80.86 | 62.39 | 84.83 | 71.64 | 81.67 | 63.33 | 81.14 | 64.62 | 82.13 | 65.50 |
+| *WebPRMs (3~4B)* | | | | | | | | | | |
+| WebShepherd-3B | 87.50 | 65.21 | 68.16 | 41.29 | 66.67 | 46.67 | 50.00 | 21.23 | 68.08 | 43.60 |
+| ⋆ WebArbiter-3B (Qwen2.5) | 93.32 | 78.42 | 81.97 | 56.22 | 78.33 | 46.67 | 81.01 | 54.81 | 83.65 | 59.06 |
+| ⋆ **WebArbiter-4B (Qwen3)** | **98.55** | **94.73** | **83.21** | **61.19** | **92.50** | **83.33** | 76.68 | 50.96 | **87.73** | **72.55** |
+WebArbiter-4B (Qwen3) substantially outperforms WebArbiter-3B (Qwen2.5) across all environments, improving Avg. BoN Acc from 59.06% to 72.55%. This approaches WebArbiter-7B (Qwen2.5) at 74.60% with roughly half the parameters.
+## Quick Start
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "ZYao720/WebArbiter-4B-Qwen3"
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True,
+)
+# Construct your prompt following the WebPRMBench format.
+# See https://huggingface.co/datasets/ZYao720/WEBPRMBENCH for examples.
+user_prompt = "..."  # evaluation prompt with intent, AXTree, trajectory, two responses
+messages = [{"role": "user", "content": user_prompt}]
+input_ids = tokenizer.apply_chat_template(
+    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt",
+).to(model.device)
+with torch.no_grad():
+    output = model.generate(input_ids=input_ids, max_new_tokens=2048, do_sample=False)
+response = tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)
+print(response)
+```
+**Example output:**
+```xml
+<State>The user is on the DuckDuckGo homepage with a search box visible.
+Relevant AXTree elements: [1] textbox 'Search', [2] button 'Search'.</State>
+<Criteria>1. Goal alignment (weight 0.6) — Does the action advance the search task?
+2. Element reference accuracy (weight 0.25) — Is the referenced element correct?
+3. Efficiency (weight 0.15) — Does the action avoid unnecessary steps?</Criteria>
+<Analysis>Response 1 directly fills the search query into the textbox, which is the
+most direct path to completing the search task. Response 2 clicks an irrelevant link
+that does not contribute to the search goal.</Analysis>
+<Answer>Response 1</Answer>
+```
+## Training Details
+| | Stage 1: Reasoning Distillation | Stage 2: RLVR |
+|---|---|---|
+| Method | Supervised fine-tuning (SFT) | GRPO with binary verifiable rewards |
+| Data | 9,642 teacher-distilled examples | 18,921 preference pairs |
+| Teacher | o3 | — |
+| Base Model | [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) | Stage 1 checkpoint |
+| Fine-tuning | LoRA | FSDP + LoRA |
+| Framework | [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) | [veRL](https://github.com/volcengine/verl) |
+| Hardware | 8 × NVIDIA A100-80GB | 8 × NVIDIA A100-80GB |
+| Source Data | [WebPRM Collection](https://huggingface.co/datasets/LangAGI-Lab/WebPRMCollection_preference_pair) (~30k step-level preference pairs from Mind2Web) |
+All variants use the same training data, distillation strategy, and RL procedure; only backbone-specific hyperparameters differ. See the [paper](https://arxiv.org/abs/2601.21872) (Appendix C) for full details.
+## Intended Uses
+WebArbiter-4B-Qwen3 is designed to:
+- **Evaluate web agent actions**: Given a web state and two candidate actions, determine which better advances the user's task.
+- **Guide trajectory search**: Serve as a reward signal for Best-of-N sampling or tree search during web agent execution.
+- **Provide interpretable feedback**: Generate structured justifications explaining why one action is preferred.
+- **Resource-efficient deployment**: Strong performance at 4B parameters — approaching 7B-level accuracy with roughly half the parameters.
+## Limitations
+- **Text-only observations**: Relies on accessibility tree representations without visual observations.
+- **English-only**: Training and evaluation are conducted exclusively in English-language web environments.
+- **Safe-action bias**: May sometimes overvalue cautious actions because the accessibility tree does not encode interaction effects.
+## License
+This model is released under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0), following the base model [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B).
+## Related Resources
+| Resource | Link |
+|----------|------|
+| WebArbiter-8B-Qwen3 (strongest) | [ZYao720/WebArbiter-8B-Qwen3](https://huggingface.co/ZYao720/WebArbiter-8B-Qwen3) |
+| WebArbiter-7B (Qwen2.5) | [ZYao720/WebArbiter-7B](https://huggingface.co/ZYao720/WebArbiter-7B) |
+| WebArbiter-3B (Qwen2.5) | [ZYao720/WebArbiter-3B](https://huggingface.co/ZYao720/WebArbiter-3B) |
+| WEBPRMBENCH (benchmark) | [ZYao720/WEBPRMBENCH](https://huggingface.co/datasets/ZYao720/WEBPRMBENCH) |
+| Training Data | [ZYao720/WebArbiter-Data](https://huggingface.co/datasets/ZYao720/WebArbiter-Data) |
+| Search Trajectories | [ZYao720/WebArbiter-Trajectories](https://huggingface.co/datasets/ZYao720/WebArbiter-Trajectories) |
+## Citation
+```bibtex
+@misc{zhang2026ZYao720principleguidedreasoningprocess,
+      title={WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents},
+      author={Yao Zhang and Shijie Tang and Zeyu Li and Zhen Han and Volker Tresp},
+      year={2026},
+      eprint={2601.21872},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI},
+      url={https://arxiv.org/abs/2601.21872},
+}
+```