CK0607
/

cross-model-lora-prediction-3b

Model card Files Files and versions

cross-model-lora-prediction-3b / ROUND6_REPORT.md

CK0607's picture

Round 6 real generation eval: ROUND6_REPORT.md

70ff653 verified about 1 month ago

|

1.5 kB

	# Round 6 — Real Generation Scaling Sweep

	Round 6 redoes Round 5's scaling sweep with real generation eval; Round 5's surrogate numbers are deprecated.

	## Scaling table — gap_recovered mean ± std across seeds and held-outs

	\| N \| mean \| global_ridge \| topk8_global_ridge \|
	\|---:\|---:\|---:\|---:\|
	\| 4 \| 0.030 ± 0.065 \| -0.003 ± 0.195 \| -0.017 ± 0.183 \|
	\| 8 \| 0.069 ± 0.062 \| 0.131 ± 0.140 \| 0.125 ± 0.154 \|
	\| 12 \| 0.077 ± 0.071 \| 0.126 ± 0.128 \| 0.117 ± 0.109 \|
	\| 16 \| 0.077 ± 0.071 \| 0.137 ± 0.126 \| 0.124 ± 0.101 \|
	\| 24 \| 0.083 ± 0.072 \| 0.135 ± 0.104 \| 0.121 ± 0.124 \|

	## Figure

	![Round 6 real generation scaling](figures/exp_scaling_3b_v2.png)

	## Headline

	Top-K does not beat global_ridge on the aggregate scaling table; at N=24 the gap is -0.014.
	Best aggregate cell: N=16, method=global_ridge, gap_recovered=0.137.

	## Honest failure modes

	- No surrogate was used: every reported predicted-adapter accuracy comes from `model.generate(do_sample=False, num_beams=1)` on the saved predicted LoRA adapter.
	- Base-Y and oracle-Y constants are reused from `results_round4.json`; predicted adapters are newly evaluated with the same prompt/answer extraction path as the R4 main table.
	- `arc_challenge` remains excluded because R4 marked it unusable (oracle-base < 3 pp), making gap_recovered noisy.
	- Code-task evaluation is still the R4 cheap string/span exact-match proxy, not sandboxed unit tests.
	- No budget reduction was applied; the full 195-cell grid completed.