# FINAL REPORT — Locality, pool transfer, and code pass@1

## 1. Locality verdict
Overall Spearman rho between X-side LoRA cosine and single-anchor target-side gap is **0.130**. Decision: **weak locality, ridge subsumes it**.
Per-held-out rho: gsm_hard=0.285, gsm8k_test_500=0.044, mbpp_test_held=-0.290, mbpp_plus=-0.101, openbookqa_test=-0.298.
Top-3 anchors per task are in `figures/locality_topk_table.md`; the scatter is `figures/locality_scatter.png`.

## 2. Pool transfer verdict
Pool sizes: math_only=8, code_only=6, science_only=8, math_plus_code=14, all=24.
Matched-vs-mismatched mean gap: matched_domain=-0.334 (n=9), mismatched_domain=-0.224 (n=11), all_control=0.121 (n=5).
Decision: **mixed; more anchors and curation both matter**. Heatmap: `figures/pool_transfer_heatmap.png`.

## 3. Pass@1 verdict
- **mbpp_test_held**: base=0.680, oracle=0.620, mean=0.620, global_ridge=0.620 (gap=1.000), topk8=0.600. Old global_ridge string-match was 0.250.
- **mbpp_plus**: base=0.556, oracle=0.550, mean=0.537, global_ridge=0.500 (gap=10.500), topk8=0.495. Old global_ridge string-match was 0.273.
Updated table: `figures/exp_main_table_3b_final.md`.

## 4. Final paper headline
R4 established usable 3B transfer on five held-outs with real base/oracle baselines; R6 found the strongest aggregate learned mapper was global_ridge around N=16 (gap_recovered=0.137); R8 showed Top-K locality sweeps did not beat global_ridge. This final round explains why: direct X-side cosine locality is weak/non-predictive at the single-anchor level, while pool composition shows whether gains come from domain curation or simply more anchors. The workshop framing is therefore cross-model LoRA translation as a low-rank task-specialized interpolation problem, not nearest-neighbor adapter retrieval.

## 5. Honest failure modes
- All A/B accuracies use real `model.generate(do_sample=False, num_beams=1)` with the locked R6 exact-match scorer; no surrogate was used.
- Pass@1 executes generated Python with timeouts; MBPP+ uses EvalPlus-style augmented tests from `evalplus/mbppplus`, while `mbpp_test_held` uses the saved sanitized MBPP test cases.
- Duplicate R4/R5 task anchors remain separate anchors because R6's 24-anchor pool treats them as independently trained adapters.
- Degenerate oracle-base denominators would make gap_recovered unstable; the five chosen held-outs use cached R4 baselines and exclude R4's unusable ARC-Challenge cell.