Maze 17x17 - bs512 Prior Ladder
Artifacts for the GOS x base-competence RL sweep on the 17x17 maze task.
Contents
checkpoints/
Six SFT checkpoints from a single training run with bs=512, lr=1e-4 constant, training random-trajectory SFT data on a 3M-param Qwen2 model (4 layers, hidden=256). Trainset Pass@1 (256 prompts x 8 gens, T=1.0):
| Checkpoint | SFT step | Pass@1 | Pass@4 |
|---|---|---|---|
ckpt-2500 |
2500 | 0.05% | 0.20% |
ckpt-3500 |
3500 | 0.10% | 0.39% |
ckpt-4500 |
4500 | 0.24% | 0.98% |
ckpt-5500 |
5500 | 1.03% | 3.93% |
ckpt-6500 |
6500 | 3.91% | 14.73% |
ckpt-7500 |
7500 | 8.35% | 28.75% |
The first four span the deep weak-prior regime (where continuous-reward RL matters most); the last two are mid-prior, useful for testing whether the GOS / GRPO gap narrows as base competence improves.
data/
train_random.json- 1.3M random-trajectory SFT examplestest.json- 256 held-out prompts (same split used by SFT eval and RL validation)
Each example is a serialized maze (symbolic grid tokens + actions) following the MaxRL Maze-17 format.