Ricardo-H
/

ocar-v3-alfworld-7b

research-post-mortem

Model card Files Files and versions

ocar-v3-alfworld-7b / README.md

Ricardo-H's picture

Update main to step_150 + model card

5030a5b verified about 2 months ago

|

history blame contribute delete

2.25 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen2.5-7B-Instruct
	tags:
	- agent-rl
	- alfworld
	- archived
	- ocar
	- research-post-mortem
	---

	# ocar-v3-alfworld-7b — Archived Checkpoints

	> ⚠️ Research line terminated (2026-04-22). These checkpoints are retained for
	> inference / analysis reproducibility only. See the
	> [post-mortem document](https://github.com/ymguan/verl-agent/blob/master/ocar/docs/POSTMORTEM_SURPRISE.md)
	> for why we do not recommend building on this method.

	## What this is

	Fine-tuned from `Qwen/Qwen2.5-7B-Instruct` on ALFWorld with OCAR v3 (Δs-based credit, adaptive τ) (verl-agent stack),
	as part of the OCAR (Observation-grounded Credit Advantage Redistribution)
	research line investigating free policy-forward-pass signals for agent RL
	credit assignment.

	## Checkpoints (per-step revisions)

	Each training step is stored as a separate git branch / revision. Load a
	specific step via `revision=`:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(
	"Ricardo-H/ocar-v3-alfworld-7b", revision="step_150", torch_dtype="bfloat16"
	)
	tokenizer = AutoTokenizer.from_pretrained("Ricardo-H/ocar-v3-alfworld-7b", revision="step_150")
	```

	Available revisions: `step_50`, `step_75`, `step_100`, `step_125`, `step_150`

	## Results summary

	See `ocar/docs/POSTMORTEM_SURPRISE.md` in the companion repo for full results.
	Key points:

	- 6-seed peak SR (ALFWorld paper-config, t=0.4): around 80% — did not match GiGPO 90.8
	- Δs signal shown to be causally circular (reads back GRPO's own updates)
	- Step-level AUC ≈ 0.5 across 4 heterogeneous base scorers
	- Cross-environment direction flip on WebShop (r(Δs, succ): −0.53 ↔ +0.65)

	## Companion resources

	- Code & analysis: <https://github.com/ymguan/verl-agent>
	- Training trajectories: `data/trajectories/` in companion repo
	- Analysis JSONs: `ocar/analysis_results/` in companion repo
	- Post-mortem: [`ocar/docs/POSTMORTEM_SURPRISE.md`](https://github.com/ymguan/verl-agent/blob/master/ocar/docs/POSTMORTEM_SURPRISE.md)

	## Citation / attribution

	These artifacts are shared in an "as-is" state. If you find the negative
	results useful, please reference the post-mortem document.