--- license: apache-2.0 base_model: Qwen/Qwen2.5-7B-Instruct tags: - agent-rl - alfworld - archived - ocar - research-post-mortem --- # ocar-v3-alfworld-7b — Archived Checkpoints > ⚠️ **Research line terminated (2026-04-22).** These checkpoints are retained for > inference / analysis reproducibility only. See the > [post-mortem document](https://github.com/ymguan/verl-agent/blob/master/ocar/docs/POSTMORTEM_SURPRISE.md) > for why we do not recommend building on this method. ## What this is Fine-tuned from `Qwen/Qwen2.5-7B-Instruct` on **ALFWorld** with **OCAR v3 (Δs-based credit, adaptive τ)** (verl-agent stack), as part of the OCAR (Observation-grounded Credit Advantage Redistribution) research line investigating free policy-forward-pass signals for agent RL credit assignment. ## Checkpoints (per-step revisions) Each training step is stored as a separate git branch / revision. Load a specific step via `revision=`: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "Ricardo-H/ocar-v3-alfworld-7b", revision="step_150", torch_dtype="bfloat16" ) tokenizer = AutoTokenizer.from_pretrained("Ricardo-H/ocar-v3-alfworld-7b", revision="step_150") ``` Available revisions: `step_50`, `step_75`, `step_100`, `step_125`, `step_150` ## Results summary See `ocar/docs/POSTMORTEM_SURPRISE.md` in the companion repo for full results. Key points: - 6-seed peak SR (ALFWorld paper-config, t=0.4): around 80% — **did not match GiGPO 90.8** - Δs signal shown to be causally circular (reads back GRPO's own updates) - Step-level AUC ≈ 0.5 across 4 heterogeneous base scorers - Cross-environment direction flip on WebShop (r(Δs, succ): −0.53 ↔ +0.65) ## Companion resources - Code & analysis: - Training trajectories: `data/trajectories/` in companion repo - Analysis JSONs: `ocar/analysis_results/` in companion repo - Post-mortem: [`ocar/docs/POSTMORTEM_SURPRISE.md`](https://github.com/ymguan/verl-agent/blob/master/ocar/docs/POSTMORTEM_SURPRISE.md) ## Citation / attribution These artifacts are shared in an "as-is" state. If you find the negative results useful, please reference the post-mortem document.