Maze 17x17 - bs512 Prior Ladder

Artifacts for the GOS x base-competence RL sweep on the 17x17 maze task.

`checkpoints/`

Six SFT checkpoints from a single training run with bs=512, lr=1e-4 constant, training random-trajectory SFT data on a 3M-param Qwen2 model (4 layers, hidden=256). Trainset Pass@1 (256 prompts x 8 gens, T=1.0):

Checkpoint	SFT step	Pass@1	Pass@4
`ckpt-2500`	2500	0.05%	0.20%
`ckpt-3500`	3500	0.10%	0.39%
`ckpt-4500`	4500	0.24%	0.98%
`ckpt-5500`	5500	1.03%	3.93%
`ckpt-6500`	6500	3.91%	14.73%
`ckpt-7500`	7500	8.35%	28.75%

The first four span the deep weak-prior regime (where continuous-reward RL matters most); the last two are mid-prior, useful for testing whether the GOS / GRPO gap narrows as base competence improves.

`data/`

train_random.json - 1.3M random-trajectory SFT examples
test.json - 256 held-out prompts (same split used by SFT eval and RL validation)

Each example is a serialized maze (symbolic grid tokens + actions) following the MaxRL Maze-17 format.

Reproduction

Code: https://github.com/stablegradients/GOS-17X17-Maze

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

stablegradients
/

maze-17-bs512-prior-ladder

Maze 17x17 - bs512 Prior Ladder

Contents

`checkpoints/`

`data/`

Reproduction