cross-model-lora-prediction-3b / ROUND3_REPORT.md
CK0607's picture
Round 3 3B domain expansion results
4838960 verified
|
raw
history blame
2.85 kB

Cross-Model LoRA Adapter Translation — Round 3 (3B Domain Expansion)

Repo: https://huggingface.co/Samarth0710/cross-model-lora-prediction-3b Models: X=Qwen/Qwen2.5-3B-Instruct → Y=meta-llama/Llama-3.2-3B-Instruct LoRA: r=8, alpha=16, dropout=0, target=q_proj,v_proj, 1 epoch SFT, bs=8, lr=2e-4, bf16, max_seq_len=512.

Experiment 1 — Main table

Domain Task base_Y mean global_ridge pertensor_ridge topk8_global_ridge topk8_pertensor_ridge pertensor_mlp oracle gap_recovered
math gsm_hard 0.057 0.063 0.053 0.057 0.050 0.047 0.067 0.073 0.600
math math_algebra_medium 0.093 0.100 0.093 0.100 0.103 0.103 0.093 0.097 3.000
code humaneval_plus 0.079 0.085 0.067 0.067 0.067 0.067 0.073 0.067 -0.500
code mbpp_plus 0.217 0.207 0.217 0.210 0.213 0.203 0.200 0.220 0.000
science arc_challenge 0.706 0.732 0.706 0.706 0.706 0.706 0.726 0.726 1.333
science mmlu_college_chemistry 0.375 0.375 0.375 0.375 0.375 0.375 0.250 0.375 NA

Success criteria

  • Best learned method minus mean baseline (average over held-out): -0.000
  • Domain average gap_recovered: {'math': 1.800000000000002, 'code': -0.2500000000000003, 'science': 1.3333333333333333}

Experiment 2 — Anchor-count + Top-K scaling

Anchor scaling

Experiment 3 — Cross-domain transfer

Transfer heatmap

Held-out domain Best anchor pool Top-K actual selections (top-3)
math code-only {'gsm_hard': ['humaneval', 'mbpp_sanitized', 'mbpp'], 'math_algebra_medium': ['humaneval', 'mbpp_sanitized', 'mbpp']}
code math-only {'humaneval_plus': ['math_counting_easy', 'multiarith', 'math_algebra_easy'], 'mbpp_plus': ['math_counting_easy', 'multiarith', 'math_algebra_easy']}
science code-only {'arc_challenge': ['humaneval', 'mbpp_sanitized', 'mbpp'], 'mmlu_college_chemistry': ['mbpp', 'mbpp_sanitized', 'humaneval']}

Honest failure modes / notes

  • Dataset-loading failures, if any, are listed in dataset_audit_round3.json; failed anchors were dropped as instructed, while preserving at least six anchors per domain when available.
  • Code-task evaluation is string/span matching against reference code, not sandboxed unit-test execution; numbers should be interpreted as a cheap adapter-locality proxy rather than pass@1.
  • If an oracle adapter does not improve over base Y, the corresponding gap_recovered is unstable/meaningless and should be treated as diagnostic rather than evidence of mapping quality.
  • Math exact-match uses numeric extraction from generated text; formatting failures are counted as wrong.