Maze 17x17 - bs512 Prior Ladder

Artifacts for the GOS x base-competence RL sweep on the 17x17 maze task.

Contents

checkpoints/

Six SFT checkpoints from a single training run with bs=512, lr=1e-4 constant, training random-trajectory SFT data on a 3M-param Qwen2 model (4 layers, hidden=256). Trainset Pass@1 (256 prompts x 8 gens, T=1.0):

Checkpoint SFT step Pass@1 Pass@4
ckpt-2500 2500 0.05% 0.20%
ckpt-3500 3500 0.10% 0.39%
ckpt-4500 4500 0.24% 0.98%
ckpt-5500 5500 1.03% 3.93%
ckpt-6500 6500 3.91% 14.73%
ckpt-7500 7500 8.35% 28.75%

The first four span the deep weak-prior regime (where continuous-reward RL matters most); the last two are mid-prior, useful for testing whether the GOS / GRPO gap narrows as base competence improves.

data/

  • train_random.json - 1.3M random-trajectory SFT examples
  • test.json - 256 held-out prompts (same split used by SFT eval and RL validation)

Each example is a serialized maze (symbolic grid tokens + actions) following the MaxRL Maze-17 format.

Reproduction

Code: https://github.com/stablegradients/GOS-17X17-Maze

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading