cross-model-lora-prediction-3b / ROUND4_REPORT.md
CK0607's picture
Round 4 oracle-fix results
574b87a verified
|
raw
history blame
5.72 kB

Cross-Model LoRA Adapter Translation — Round 4

Repo: https://huggingface.co/CK0607/cross-model-lora-prediction-3b Models: X=Qwen/Qwen2.5-3B-Instruct → Y=meta-llama/Llama-3.2-3B-Instruct

Diff vs Round 3

  • Kept Round 3 3B model pair and mapping algorithms unchanged.
  • Replaced broken held-outs: math_algebra_medium → gsm8k_test_500, humaneval_plus → mbpp_test_held, mmlu_college_chemistry → openbookqa_test.
  • Retrained only the bounded Round 4 pool: 16 matched X/Y anchors plus 6 X held-out conditioning adapters and 6 Y oracle adapters.
  • Stronger recipe: LoRA r=16, alpha=32, targets=['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'], epochs=3.0, train_per_task=1500, lr=0.0002, bf16, max_len=512.
  • Recomputed Top-K cosine selection from the new r=16/full-target X adapter space.

Experiment 1 — Main table

Rows with oracle - base_Y < 3 pp are flagged as not usable for averages.

Domain Task base_Y mean global_ridge pertensor_ridge topk8_global_ridge topk8_pertensor_ridge pertensor_mlp oracle oracle_minus_base_pp usable gap_recovered
math gsm_hard 0.063 0.057 0.060 0.067 0.067 0.063 0.073 0.150 8.667 True 0.115
math gsm8k_test_500 0.080 0.093 0.100 0.100 0.093 0.097 0.100 0.293 21.333 True 0.094
code mbpp_test_held 0.230 0.240 0.250 0.250 0.250 0.250 0.240 0.320 9.000 True 0.222
code mbpp_plus 0.217 0.213 0.280 0.270 0.270 0.267 0.210 0.450 23.333 True 0.271
science arc_challenge 0.716 0.732 0.736 0.729 0.736 0.729 0.739 0.722 0.669 False 5.000
science openbookqa_test 0.710 0.760 0.747 0.743 0.713 0.717 0.753 0.983 27.333 True 0.183

Headline

  • Best learned method minus mean baseline, averaged over usable held-outs: 0.0187
  • Usable held-outs: ['gsm_hard', 'gsm8k_test_500', 'mbpp_test_held', 'mbpp_plus', 'openbookqa_test']
  • Excluded held-outs: ['arc_challenge']

Top-K selection log

Held-out topk8_global_ridge topk8_pertensor_ridge
gsm_hard ['math_counting_easy', 'mbpp_sanitized', 'mmlu_high_school_physics', 'humaneval', 'multiarith', 'math_algebra_easy', 'mmlu_elementary_math', 'mmlu_high_school_biology'] ['math_counting_easy', 'mbpp_sanitized', 'mmlu_high_school_physics', 'humaneval', 'multiarith', 'math_algebra_easy', 'mmlu_elementary_math', 'mmlu_high_school_biology']
gsm8k_test_500 ['math_counting_easy', 'mbpp_sanitized', 'mmlu_high_school_physics', 'humaneval', 'multiarith', 'math_algebra_easy', 'mmlu_elementary_math', 'mmlu_high_school_biology'] ['math_counting_easy', 'mbpp_sanitized', 'mmlu_high_school_physics', 'humaneval', 'multiarith', 'math_algebra_easy', 'mmlu_elementary_math', 'mmlu_high_school_biology']
mbpp_test_held ['mbpp_sanitized', 'math_counting_easy', 'humaneval', 'mmlu_high_school_physics', 'multiarith', 'mmlu_high_school_biology', 'mmlu_elementary_math', 'math_algebra_easy'] ['mbpp_sanitized', 'math_counting_easy', 'humaneval', 'mmlu_high_school_physics', 'multiarith', 'mmlu_high_school_biology', 'mmlu_elementary_math', 'math_algebra_easy']
mbpp_plus ['mbpp_sanitized', 'humaneval', 'math_counting_easy', 'mmlu_high_school_physics', 'multiarith', 'mmlu_high_school_biology', 'mmlu_elementary_math', 'math_algebra_easy'] ['mbpp_sanitized', 'humaneval', 'math_counting_easy', 'mmlu_high_school_physics', 'multiarith', 'mmlu_high_school_biology', 'mmlu_elementary_math', 'math_algebra_easy']
arc_challenge ['mmlu_high_school_physics', 'mmlu_high_school_biology', 'mmlu_elementary_math', 'math_counting_easy', 'mbpp_sanitized', 'humaneval', 'multiarith', 'math_algebra_easy'] ['mmlu_high_school_physics', 'mmlu_high_school_biology', 'mmlu_elementary_math', 'math_counting_easy', 'mbpp_sanitized', 'humaneval', 'multiarith', 'math_algebra_easy']
openbookqa_test ['mmlu_high_school_physics', 'mmlu_high_school_biology', 'mbpp_sanitized', 'math_counting_easy', 'mmlu_elementary_math', 'humaneval', 'multiarith', 'math_algebra_easy'] ['mmlu_high_school_physics', 'mmlu_high_school_biology', 'mbpp_sanitized', 'math_counting_easy', 'mmlu_elementary_math', 'humaneval', 'multiarith', 'math_algebra_easy']

Experiment 2 — Anchor-count + Top-K scaling

Anchor scaling

Experiment 3 — Cross-domain transfer

Transfer heatmap

Held-out domain Best anchor pool Top-K actual selections (top-3)
math science-only {'gsm_hard': ['mmlu_high_school_physics', 'mmlu_elementary_math', 'mmlu_high_school_biology'], 'gsm8k_test_500': ['mmlu_high_school_physics', 'mmlu_elementary_math', 'mmlu_high_school_biology']}
code code-only {'mbpp_test_held': ['mbpp_sanitized', 'humaneval', 'mbpp'], 'mbpp_plus': ['mbpp_sanitized', 'humaneval', 'mbpp']}
science science-only {'arc_challenge': ['mmlu_high_school_physics', 'mmlu_high_school_biology', 'mmlu_elementary_math'], 'openbookqa_test': ['mmlu_high_school_physics', 'mmlu_high_school_biology', 'mmlu_elementary_math']}

Honest failure modes

  • Excluded from averages: arc_challenge has oracle-base = 0.67 pp.
  • Code-task evaluation remains cheap answer-string/span matching, not sandboxed unit tests; code numbers are adapter-locality proxies, not pass@1.
  • Math uses numeric extraction/equality; formatting or non-numeric generations are counted wrong.
  • Top-K and ridge methods are exactly the prior mapping family; no new mapping method was added.