---
tags:
  - world-model
  - grpo
  - webshop
  - functional-equivalence
base_model: X1AOX1A/WorldModel-Webshop-Qwen2.5-7B
---

# ws-wm-0218-step-10

Exp8-r2 (0218) World Model checkpoint at step 10 (epoch 0.07).

## Training
- **Method**: Pivot-GRPO with Cauchy α=1.0, BehR-only reward
- **Base**: Qwen2.5-7B SFT World Model
- **Bug Fixes**: 9 critical fixes (negative reward, signal dilution, API dedup, etc.)
- **Ckpt Monitor BehR**: See collection description for full results

## Key Config
- `reward_mode=cauchy`, `behavior_scale_coef=1.0`
- `behavior_weight=1.0`, `facts_weight=0`, `length_penalty_weight=0`
- `lr=2e-6`, `batch_size=32`, `temperature=1.2`
- `max_prompt_length=28672`, `ppo_epochs=1`
- Judge: Qwen3-8B (vLLM TP=4)

Part of collection: [ws-wm-0218](https://huggingface.co/collections/Ricardo-H/ws-wm-0218)