# FINAL REPORT — Locality, pool transfer, and code pass@1 ## 1. Locality verdict Overall Spearman rho between X-side LoRA cosine and single-anchor target-side gap is **0.130**. Decision: **weak locality, ridge subsumes it**. Per-held-out rho: gsm_hard=0.285, gsm8k_test_500=0.044, mbpp_test_held=-0.290, mbpp_plus=-0.101, openbookqa_test=-0.298. Top-3 anchors per task are in `figures/locality_topk_table.md`; the scatter is `figures/locality_scatter.png`. ## 2. Pool transfer verdict Pool sizes: math_only=8, code_only=6, science_only=8, math_plus_code=14, all=24. Matched-vs-mismatched mean gap: matched_domain=-0.334 (n=9), mismatched_domain=-0.224 (n=11), all_control=0.121 (n=5). Decision: **mixed; more anchors and curation both matter**. Heatmap: `figures/pool_transfer_heatmap.png`. ## 3. Pass@1 verdict - **mbpp_test_held**: base=0.680, oracle=0.620, mean=0.620, global_ridge=0.620 (gap=1.000), topk8=0.600. Old global_ridge string-match was 0.250. - **mbpp_plus**: base=0.556, oracle=0.550, mean=0.537, global_ridge=0.500 (gap=10.500), topk8=0.495. Old global_ridge string-match was 0.273. Updated table: `figures/exp_main_table_3b_final.md`. ## 4. Final paper headline R4 established usable 3B transfer on five held-outs with real base/oracle baselines; R6 found the strongest aggregate learned mapper was global_ridge around N=16 (gap_recovered=0.137); R8 showed Top-K locality sweeps did not beat global_ridge. This final round explains why: direct X-side cosine locality is weak/non-predictive at the single-anchor level, while pool composition shows whether gains come from domain curation or simply more anchors. The workshop framing is therefore cross-model LoRA translation as a low-rank task-specialized interpolation problem, not nearest-neighbor adapter retrieval. ## 5. Honest failure modes - All A/B accuracies use real `model.generate(do_sample=False, num_beams=1)` with the locked R6 exact-match scorer; no surrogate was used. - Pass@1 executes generated Python with timeouts; MBPP+ uses EvalPlus-style augmented tests from `evalplus/mbppplus`, while `mbpp_test_held` uses the saved sanitized MBPP test cases. - Duplicate R4/R5 task anchors remain separate anchors because R6's 24-anchor pool treats them as independently trained adapters. - Degenerate oracle-base denominators would make gap_recovered unstable; the five chosen held-outs use cached R4 baselines and exclude R4's unusable ARC-Challenge cell.