| # Round 6 — Real Generation Scaling Sweep | |
| Round 6 redoes Round 5's scaling sweep with real generation eval; Round 5's surrogate numbers are deprecated. | |
| ## Scaling table — gap_recovered mean ± std across seeds and held-outs | |
| | N | mean | global_ridge | topk8_global_ridge | | |
| |---:|---:|---:|---:| | |
| | 4 | 0.030 ± 0.065 | -0.003 ± 0.195 | -0.017 ± 0.183 | | |
| | 8 | 0.069 ± 0.062 | 0.131 ± 0.140 | 0.125 ± 0.154 | | |
| | 12 | 0.077 ± 0.071 | 0.126 ± 0.128 | 0.117 ± 0.109 | | |
| | 16 | 0.077 ± 0.071 | 0.137 ± 0.126 | 0.124 ± 0.101 | | |
| | 24 | 0.083 ± 0.072 | 0.135 ± 0.104 | 0.121 ± 0.124 | | |
| ## Figure | |
|  | |
| ## Headline | |
| Top-K does not beat global_ridge on the aggregate scaling table; at N=24 the gap is -0.014. | |
| Best aggregate cell: N=16, method=global_ridge, gap_recovered=0.137. | |
| ## Honest failure modes | |
| - No surrogate was used: every reported predicted-adapter accuracy comes from `model.generate(do_sample=False, num_beams=1)` on the saved predicted LoRA adapter. | |
| - Base-Y and oracle-Y constants are reused from `results_round4.json`; predicted adapters are newly evaluated with the same prompt/answer extraction path as the R4 main table. | |
| - `arc_challenge` remains excluded because R4 marked it unusable (oracle-base < 3 pp), making gap_recovered noisy. | |
| - Code-task evaluation is still the R4 cheap string/span exact-match proxy, not sandboxed unit tests. | |
| - No budget reduction was applied; the full 195-cell grid completed. | |