Round 6 — Real Generation Scaling Sweep

Round 6 redoes Round 5's scaling sweep with real generation eval; Round 5's surrogate numbers are deprecated.

Scaling table — gap_recovered mean ± std across seeds and held-outs

N	mean	global_ridge	topk8_global_ridge
4	0.030 ± 0.065	-0.003 ± 0.195	-0.017 ± 0.183
8	0.069 ± 0.062	0.131 ± 0.140	0.125 ± 0.154
12	0.077 ± 0.071	0.126 ± 0.128	0.117 ± 0.109
16	0.077 ± 0.071	0.137 ± 0.126	0.124 ± 0.101
24	0.083 ± 0.072	0.135 ± 0.104	0.121 ± 0.124

Figure

Headline

Top-K does not beat global_ridge on the aggregate scaling table; at N=24 the gap is -0.014. Best aggregate cell: N=16, method=global_ridge, gap_recovered=0.137.

Honest failure modes

No surrogate was used: every reported predicted-adapter accuracy comes from model.generate(do_sample=False, num_beams=1) on the saved predicted LoRA adapter.
Base-Y and oracle-Y constants are reused from results_round4.json; predicted adapters are newly evaluated with the same prompt/answer extraction path as the R4 main table.
arc_challenge remains excluded because R4 marked it unusable (oracle-base < 3 pp), making gap_recovered noisy.
Code-task evaluation is still the R4 cheap string/span exact-match proxy, not sandboxed unit tests.
No budget reduction was applied; the full 195-cell grid completed.