--- tags: - world-model - grpo - webshop - functional-equivalence base_model: X1AOX1A/WorldModel-Webshop-Qwen2.5-7B --- # ws-wm-0218-step-10 Exp8-r2 (0218) World Model checkpoint at step 10 (epoch 0.07). ## Training - **Method**: Pivot-GRPO with Cauchy α=1.0, BehR-only reward - **Base**: Qwen2.5-7B SFT World Model - **Bug Fixes**: 9 critical fixes (negative reward, signal dilution, API dedup, etc.) - **Ckpt Monitor BehR**: See collection description for full results ## Key Config - `reward_mode=cauchy`, `behavior_scale_coef=1.0` - `behavior_weight=1.0`, `facts_weight=0`, `length_penalty_weight=0` - `lr=2e-6`, `batch_size=32`, `temperature=1.2` - `max_prompt_length=28672`, `ppo_epochs=1` - Judge: Qwen3-8B (vLLM TP=4) Part of collection: [ws-wm-0218](https://huggingface.co/collections/Ricardo-H/ws-wm-0218)