[2026-05-05 00:07:38] Dataset audit kept=22 dropped=[] domain_counts={'math': 6, 'code': 3, 'science': 7}
[2026-05-05 00:07:38] Launching 44 LoRA trainings across 8 workers
[2026-05-05 00:07:47] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / gsm8k -> /workspace/round3_out/round4/X/gsm8k
[2026-05-05 00:07:47] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / gsm8k -> /workspace/round3_out/round4/Y/gsm8k
[2026-05-05 00:07:47] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / mbpp -> /workspace/round3_out/round4/X/mbpp
[2026-05-05 00:07:47] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / mbpp -> /workspace/round3_out/round4/Y/mbpp
[2026-05-05 00:07:47] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / sciq -> /workspace/round3_out/round4/X/sciq
[2026-05-05 00:07:47] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / sciq -> /workspace/round3_out/round4/Y/sciq
[2026-05-05 00:07:47] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / arc_easy -> /workspace/round3_out/round4/X/arc_easy
[2026-05-05 00:07:47] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / arc_easy -> /workspace/round3_out/round4/Y/arc_easy
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 1.9858, 'grad_norm': 2.2264599800109863, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.06666666666666667}
{'loss': 1.1075, 'grad_norm': 0.4562191069126129, 'learning_rate': 9.252699064135758e-05, 'epoch': 1.6666666666666665}
{'train_runtime': 15.4993, 'train_samples_per_second': 23.227, 'train_steps_per_second': 2.903, 'train_loss': 0.9356466081407335, 'epoch': 3.0}
[2026-05-05 00:09:17] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / mbpp
[2026-05-05 00:09:17] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / svamp -> /workspace/round3_out/round4/Y/svamp
{'loss': 1.7788, 'grad_norm': 4.538092136383057, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.06666666666666667}
{'loss': 0.8198, 'grad_norm': 0.4516417384147644, 'learning_rate': 9.252699064135758e-05, 'epoch': 1.6666666666666665}
{'train_runtime': 16.4973, 'train_samples_per_second': 21.822, 'train_steps_per_second': 2.728, 'train_loss': 0.6866574658287896, 'epoch': 3.0}
[2026-05-05 00:09:21] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / mbpp
[2026-05-05 00:09:21] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / svamp -> /workspace/round3_out/round4/X/svamp
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 3.8851, 'grad_norm': 5.414454460144043, 'learning_rate': 1.4285714285714285e-05, 'epoch': 0.011363636363636364}
{'loss': 1.879, 'grad_norm': 0.9186714291572571, 'learning_rate': 0.00019904614256966512, 'epoch': 0.2840909090909091}
{'loss': 0.8067, 'grad_norm': 0.6794582009315491, 'learning_rate': 0.0001899405251566371, 'epoch': 0.5681818181818182}
{'loss': 0.6268, 'grad_norm': 0.9997307062149048, 'learning_rate': 0.0001720309024887907, 'epoch': 0.8522727272727273}
{'loss': 0.4776, 'grad_norm': 1.1762956380844116, 'learning_rate': 0.0001470703932165333, 'epoch': 1.1363636363636362}
{'loss': 0.378, 'grad_norm': 0.6738011837005615, 'learning_rate': 0.00011750230589752762, 'epoch': 1.4204545454545454}
{'loss': 0.3895, 'grad_norm': 0.9565035700798035, 'learning_rate': 8.62209709315362e-05, 'epoch': 1.7045454545454546}
{'loss': 0.3647, 'grad_norm': 0.6989470720291138, 'learning_rate': 5.6288423334906735e-05, 'epoch': 1.9886363636363638}
{'loss': 0.2745, 'grad_norm': 1.1401152610778809, 'learning_rate': 3.063466941871952e-05, 'epoch': 2.2727272727272725}
{'loss': 0.267, 'grad_norm': 0.7289299964904785, 'learning_rate': 1.1770877356504683e-05, 'epoch': 2.5568181818181817}
{'loss': 0.2583, 'grad_norm': 1.0904778242111206, 'learning_rate': 1.543566547079467e-06, 'epoch': 2.840909090909091}
{'train_runtime': 51.0409, 'train_samples_per_second': 41.143, 'train_steps_per_second': 5.172, 'train_loss': 0.5629278331091909, 'epoch': 3.0}
[2026-05-05 00:10:21] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / svamp
[2026-05-05 00:10:21] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / humaneval -> /workspace/round3_out/round4/Y/humaneval
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 3.426, 'grad_norm': 15.716981887817383, 'learning_rate': 1.4285714285714285e-05, 'epoch': 0.011363636363636364}
{'loss': 1.2237, 'grad_norm': 1.0280747413635254, 'learning_rate': 0.00019904614256966512, 'epoch': 0.2840909090909091}
{'loss': 0.4819, 'grad_norm': 1.0859547853469849, 'learning_rate': 0.0001899405251566371, 'epoch': 0.5681818181818182}
{'loss': 0.4231, 'grad_norm': 1.0115028619766235, 'learning_rate': 0.0001720309024887907, 'epoch': 0.8522727272727273}
{'loss': 0.3327, 'grad_norm': 0.7834762930870056, 'learning_rate': 0.0001470703932165333, 'epoch': 1.1363636363636362}
{'loss': 0.2646, 'grad_norm': 0.8624564409255981, 'learning_rate': 0.00011750230589752762, 'epoch': 1.4204545454545454}
{'loss': 0.2764, 'grad_norm': 0.9102126955986023, 'learning_rate': 8.62209709315362e-05, 'epoch': 1.7045454545454546}
{'loss': 0.2753, 'grad_norm': 0.6169250011444092, 'learning_rate': 5.6288423334906735e-05, 'epoch': 1.9886363636363638}
{'loss': 0.1749, 'grad_norm': 1.0368030071258545, 'learning_rate': 3.063466941871952e-05, 'epoch': 2.2727272727272725}
{'loss': 0.1769, 'grad_norm': 0.8564444184303284, 'learning_rate': 1.1770877356504683e-05, 'epoch': 2.5568181818181817}
{'loss': 0.1632, 'grad_norm': 0.7322615385055542, 'learning_rate': 1.543566547079467e-06, 'epoch': 2.840909090909091}
{'train_runtime': 65.3586, 'train_samples_per_second': 32.13, 'train_steps_per_second': 4.039, 'train_loss': 0.3761867384115855, 'epoch': 3.0}
[2026-05-05 00:10:39] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / svamp
[2026-05-05 00:10:39] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / humaneval -> /workspace/round3_out/round4/X/humaneval
{'loss': 4.2319, 'grad_norm': 5.922606945037842, 'learning_rate': 6.896551724137932e-06, 'epoch': 0.005319148936170213}
{'loss': 2.4312, 'grad_norm': 0.7614614367485046, 'learning_rate': 0.00017241379310344826, 'epoch': 0.13297872340425532}
{'loss': 0.9135, 'grad_norm': 0.5550001263618469, 'learning_rate': 0.00019924063537459386, 'epoch': 0.26595744680851063}
{'loss': 0.8105, 'grad_norm': 0.547438383102417, 'learning_rate': 0.00019637393494757147, 'epoch': 0.39893617021276595}
{'loss': 0.7767, 'grad_norm': 0.514215350151062, 'learning_rate': 0.00019143398446884148, 'epoch': 0.5319148936170213}
{'loss': 0.7888, 'grad_norm': 0.5550564527511597, 'learning_rate': 0.00018452705491915232, 'epoch': 0.6648936170212766}
{'loss': 0.7401, 'grad_norm': 0.4337121844291687, 'learning_rate': 0.00017580173203440679, 'epoch': 0.7978723404255319}
{'loss': 0.7774, 'grad_norm': 0.39039838314056396, 'learning_rate': 0.00016544571984611307, 'epoch': 0.9308510638297872}
{'loss': 0.7268, 'grad_norm': 0.40669241547584534, 'learning_rate': 0.00015368180268715678, 'epoch': 1.0638297872340425}
{'loss': 0.6599, 'grad_norm': 0.7467406392097473, 'learning_rate': 0.00014076305253048747, 'epoch': 1.196808510638298}
{'loss': 0.6334, 'grad_norm': 0.6880893707275391, 'learning_rate': 0.00012696738476313262, 'epoch': 1.3297872340425532}
{'loss': 0.6679, 'grad_norm': 0.6737778186798096, 'learning_rate': 0.0001125915795147773, 'epoch': 1.4627659574468086}
{'loss': 0.659, 'grad_norm': 0.6910892128944397, 'learning_rate': 9.79448971574372e-05, 'epoch': 1.5957446808510638}
{'loss': 0.646, 'grad_norm': 0.657259464263916, 'learning_rate': 8.334242532316977e-05, 'epoch': 1.728723404255319}
{'loss': 0.6569, 'grad_norm': 0.6031177043914795, 'learning_rate': 6.909830056250527e-05, 'epoch': 1.8617021276595744}
{'loss': 0.6278, 'grad_norm': 0.7731090784072876, 'learning_rate': 5.55189504630756e-05, 'epoch': 1.9946808510638299}
{'loss': 0.4948, 'grad_norm': 0.8571438193321228, 'learning_rate': 4.289650160776967e-05, 'epoch': 2.127659574468085}
{'loss': 0.4625, 'grad_norm': 1.0014036893844604, 'learning_rate': 3.1502495184110666e-05, 'epoch': 2.2606382978723403}
{'loss': 0.4529, 'grad_norm': 0.9621121287345886, 'learning_rate': 2.1582045438184463e-05, 'epoch': 2.393617021276596}
{'loss': 0.4658, 'grad_norm': 0.9596170783042908, 'learning_rate': 1.334856663973003e-05, 'epoch': 2.526595744680851}
{'loss': 0.4579, 'grad_norm': 1.0987216234207153, 'learning_rate': 6.979181994870587e-06, 'epoch': 2.6595744680851063}
{'loss': 0.4419, 'grad_norm': 1.0002917051315308, 'learning_rate': 2.6109132725262164e-06, 'epoch': 2.7925531914893615}
{'loss': 0.4591, 'grad_norm': 0.984480619430542, 'learning_rate': 3.3773311539742057e-07, 'epoch': 2.925531914893617}
{'train_runtime': 107.9352, 'train_samples_per_second': 41.692, 'train_steps_per_second': 5.225, 'train_loss': 0.7122570115623744, 'epoch': 3.0}
[2026-05-05 00:10:50] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / sciq
[2026-05-05 00:10:50] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / multiarith -> /workspace/round3_out/round4/Y/multiarith
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 3.552, 'grad_norm': 4.550617218017578, 'learning_rate': 6.896551724137932e-06, 'epoch': 0.005319148936170213}
{'loss': 2.2877, 'grad_norm': 0.7822368741035461, 'learning_rate': 0.00017241379310344826, 'epoch': 0.13297872340425532}
{'loss': 0.975, 'grad_norm': 0.5886964797973633, 'learning_rate': 0.00019924063537459386, 'epoch': 0.26595744680851063}
{'loss': 0.8684, 'grad_norm': 0.4764474630355835, 'learning_rate': 0.00019637393494757147, 'epoch': 0.39893617021276595}
{'loss': 0.8958, 'grad_norm': 0.5718143582344055, 'learning_rate': 0.00019143398446884148, 'epoch': 0.5319148936170213}
{'loss': 0.89, 'grad_norm': 0.5914545059204102, 'learning_rate': 0.00018452705491915232, 'epoch': 0.6648936170212766}
{'loss': 0.8775, 'grad_norm': 0.43759632110595703, 'learning_rate': 0.00017580173203440679, 'epoch': 0.7978723404255319}
{'loss': 0.842, 'grad_norm': 0.4401934742927551, 'learning_rate': 0.00016544571984611307, 'epoch': 0.9308510638297872}
{'loss': 0.8443, 'grad_norm': 0.5626454949378967, 'learning_rate': 0.00015368180268715678, 'epoch': 1.0638297872340425}
{'loss': 0.7373, 'grad_norm': 0.7259774208068848, 'learning_rate': 0.00014076305253048747, 'epoch': 1.196808510638298}
{'loss': 0.733, 'grad_norm': 0.6643127202987671, 'learning_rate': 0.00012696738476313262, 'epoch': 1.3297872340425532}
{'loss': 0.7389, 'grad_norm': 0.7342930436134338, 'learning_rate': 0.0001125915795147773, 'epoch': 1.4627659574468086}
{'loss': 0.7102, 'grad_norm': 0.687857985496521, 'learning_rate': 9.79448971574372e-05, 'epoch': 1.5957446808510638}
{'loss': 0.7443, 'grad_norm': 0.6613479852676392, 'learning_rate': 8.334242532316977e-05, 'epoch': 1.728723404255319}
{'loss': 0.7265, 'grad_norm': 0.6742613911628723, 'learning_rate': 6.909830056250527e-05, 'epoch': 1.8617021276595744}
{'loss': 0.7238, 'grad_norm': 0.6487808227539062, 'learning_rate': 5.55189504630756e-05, 'epoch': 1.9946808510638299}
{'loss': 0.5582, 'grad_norm': 0.9416304230690002, 'learning_rate': 4.289650160776967e-05, 'epoch': 2.127659574468085}
{'loss': 0.5317, 'grad_norm': 0.9671899080276489, 'learning_rate': 3.1502495184110666e-05, 'epoch': 2.2606382978723403}
{'loss': 0.523, 'grad_norm': 1.0433975458145142, 'learning_rate': 2.1582045438184463e-05, 'epoch': 2.393617021276596}
{'loss': 0.5262, 'grad_norm': 1.0981394052505493, 'learning_rate': 1.334856663973003e-05, 'epoch': 2.526595744680851}
{'loss': 0.5262, 'grad_norm': 1.204676866531372, 'learning_rate': 6.979181994870587e-06, 'epoch': 2.6595744680851063}
{'loss': 0.5139, 'grad_norm': 0.9614864587783813, 'learning_rate': 2.6109132725262164e-06, 'epoch': 2.7925531914893615}
{'loss': 0.5281, 'grad_norm': 1.0337063074111938, 'learning_rate': 3.3773311539742057e-07, 'epoch': 2.925531914893617}
{'train_runtime': 108.9654, 'train_samples_per_second': 41.298, 'train_steps_per_second': 5.176, 'train_loss': 0.7818015175508269, 'epoch': 3.0}
[2026-05-05 00:10:53] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / arc_easy
[2026-05-05 00:10:53] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / mmlu_high_school_biology -> /workspace/round3_out/round4/Y/mmlu_high_school_biology
{'loss': 3.466, 'grad_norm': 4.469822883605957, 'learning_rate': 6.896551724137932e-06, 'epoch': 0.005319148936170213}
{'loss': 2.1909, 'grad_norm': 0.706962525844574, 'learning_rate': 0.00017241379310344826, 'epoch': 0.13297872340425532}
{'loss': 1.0785, 'grad_norm': 0.5822852849960327, 'learning_rate': 0.00019924063537459386, 'epoch': 0.26595744680851063}
{'loss': 1.0387, 'grad_norm': 0.5392487049102783, 'learning_rate': 0.00019637393494757147, 'epoch': 0.39893617021276595}
{'loss': 1.0559, 'grad_norm': 0.5173244476318359, 'learning_rate': 0.00019143398446884148, 'epoch': 0.5319148936170213}
{'loss': 1.0412, 'grad_norm': 0.4650200307369232, 'learning_rate': 0.00018452705491915232, 'epoch': 0.6648936170212766}
{'loss': 1.0279, 'grad_norm': 0.4582115709781647, 'learning_rate': 0.00017580173203440679, 'epoch': 0.7978723404255319}
{'loss': 0.9937, 'grad_norm': 0.44744694232940674, 'learning_rate': 0.00016544571984611307, 'epoch': 0.9308510638297872}
{'loss': 0.9737, 'grad_norm': 0.5003761649131775, 'learning_rate': 0.00015368180268715678, 'epoch': 1.0638297872340425}
{'loss': 0.8887, 'grad_norm': 0.6041138768196106, 'learning_rate': 0.00014076305253048747, 'epoch': 1.196808510638298}
{'loss': 0.9008, 'grad_norm': 0.6111604571342468, 'learning_rate': 0.00012696738476313262, 'epoch': 1.3297872340425532}
{'loss': 0.9318, 'grad_norm': 0.676913857460022, 'learning_rate': 0.0001125915795147773, 'epoch': 1.4627659574468086}
{'loss': 0.8907, 'grad_norm': 0.6538789868354797, 'learning_rate': 9.79448971574372e-05, 'epoch': 1.5957446808510638}
{'loss': 0.8917, 'grad_norm': 0.6129759550094604, 'learning_rate': 8.334242532316977e-05, 'epoch': 1.728723404255319}
{'loss': 0.872, 'grad_norm': 0.607018768787384, 'learning_rate': 6.909830056250527e-05, 'epoch': 1.8617021276595744}
{'loss': 0.8776, 'grad_norm': 0.6016894578933716, 'learning_rate': 5.55189504630756e-05, 'epoch': 1.9946808510638299}
{'loss': 0.7046, 'grad_norm': 0.8797852396965027, 'learning_rate': 4.289650160776967e-05, 'epoch': 2.127659574468085}
{'loss': 0.6808, 'grad_norm': 0.8681241869926453, 'learning_rate': 3.1502495184110666e-05, 'epoch': 2.2606382978723403}
{'loss': 0.7057, 'grad_norm': 0.9554598331451416, 'learning_rate': 2.1582045438184463e-05, 'epoch': 2.393617021276596}
{'loss': 0.6908, 'grad_norm': 1.1023428440093994, 'learning_rate': 1.334856663973003e-05, 'epoch': 2.526595744680851}
{'loss': 0.6968, 'grad_norm': 1.027468204498291, 'learning_rate': 6.979181994870587e-06, 'epoch': 2.6595744680851063}
{'loss': 0.6945, 'grad_norm': 1.0713446140289307, 'learning_rate': 2.6109132725262164e-06, 'epoch': 2.7925531914893615}
{'loss': 0.6944, 'grad_norm': 1.056596279144287, 'learning_rate': 3.3773311539742057e-07, 'epoch': 2.925531914893617}
{'train_runtime': 110.8303, 'train_samples_per_second': 40.603, 'train_steps_per_second': 5.089, 'train_loss': 0.9285172393135991, 'epoch': 3.0}
[2026-05-05 00:10:55] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / gsm8k
[2026-05-05 00:10:55] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / openbookqa -> /workspace/round3_out/round4/Y/openbookqa
{'loss': 1.9377, 'grad_norm': 2.301626682281494, 'learning_rate': 5e-05, 'epoch': 0.047619047619047616}
{'loss': 1.2316, 'grad_norm': 0.4350847005844116, 'learning_rate': 0.00014373073204588556, 'epoch': 1.1904761904761905}
{'loss': 0.8024, 'grad_norm': 0.7615563869476318, 'learning_rate': 2.301660165700936e-05, 'epoch': 2.380952380952381}
{'train_runtime': 22.0416, 'train_samples_per_second': 22.321, 'train_steps_per_second': 2.858, 'train_loss': 0.9705733874487499, 'epoch': 3.0}
[2026-05-05 00:10:56] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / humaneval
[2026-05-05 00:10:56] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / math_algebra_easy -> /workspace/round3_out/round4/Y/math_algebra_easy
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 1.6076, 'grad_norm': 5.1808762550354, 'learning_rate': 5e-05, 'epoch': 0.047619047619047616}
{'loss': 0.826, 'grad_norm': 0.43199753761291504, 'learning_rate': 0.00014373073204588556, 'epoch': 1.1904761904761905}
{'loss': 0.5047, 'grad_norm': 0.3374844491481781, 'learning_rate': 2.301660165700936e-05, 'epoch': 2.380952380952381}
{'train_runtime': 23.4352, 'train_samples_per_second': 20.994, 'train_steps_per_second': 2.688, 'train_loss': 0.638380039305914, 'epoch': 3.0}
[2026-05-05 00:11:15] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / humaneval
[2026-05-05 00:11:15] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / math_algebra_easy -> /workspace/round3_out/round4/X/math_algebra_easy
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 2.4662, 'grad_norm': 10.978902816772461, 'learning_rate': 6.896551724137932e-06, 'epoch': 0.005319148936170213}
{'loss': 0.9971, 'grad_norm': 0.8275803327560425, 'learning_rate': 0.00017241379310344826, 'epoch': 0.13297872340425532}
{'loss': 0.1533, 'grad_norm': 0.7566442489624023, 'learning_rate': 0.00019924063537459386, 'epoch': 0.26595744680851063}
{'loss': 0.1406, 'grad_norm': 0.37338656187057495, 'learning_rate': 0.00019637393494757147, 'epoch': 0.39893617021276595}
{'loss': 0.1397, 'grad_norm': 0.6687091588973999, 'learning_rate': 0.00019143398446884148, 'epoch': 0.5319148936170213}
{'loss': 0.1419, 'grad_norm': 0.3830748200416565, 'learning_rate': 0.00018452705491915232, 'epoch': 0.6648936170212766}
{'loss': 0.1383, 'grad_norm': 0.5664077401161194, 'learning_rate': 0.00017580173203440679, 'epoch': 0.7978723404255319}
{'loss': 0.1405, 'grad_norm': 0.589083731174469, 'learning_rate': 0.00016544571984611307, 'epoch': 0.9308510638297872}
{'loss': 0.1318, 'grad_norm': 0.37843361496925354, 'learning_rate': 0.00015368180268715678, 'epoch': 1.0638297872340425}
{'loss': 0.1188, 'grad_norm': 0.37342774868011475, 'learning_rate': 0.00014076305253048747, 'epoch': 1.196808510638298}
{'loss': 0.1135, 'grad_norm': 0.2037201225757599, 'learning_rate': 0.00012696738476313262, 'epoch': 1.3297872340425532}
{'loss': 0.1192, 'grad_norm': 0.16403494775295258, 'learning_rate': 0.0001125915795147773, 'epoch': 1.4627659574468086}
{'loss': 0.1153, 'grad_norm': 0.18991553783416748, 'learning_rate': 9.79448971574372e-05, 'epoch': 1.5957446808510638}
{'loss': 0.1188, 'grad_norm': 0.5353216528892517, 'learning_rate': 8.334242532316977e-05, 'epoch': 1.728723404255319}
{'loss': 0.1137, 'grad_norm': 0.2814735174179077, 'learning_rate': 6.909830056250527e-05, 'epoch': 1.8617021276595744}
{'loss': 0.1133, 'grad_norm': 0.3128826916217804, 'learning_rate': 5.55189504630756e-05, 'epoch': 1.9946808510638299}
{'loss': 0.098, 'grad_norm': 0.23966841399669647, 'learning_rate': 4.289650160776967e-05, 'epoch': 2.127659574468085}
{'loss': 0.098, 'grad_norm': 0.19975632429122925, 'learning_rate': 3.1502495184110666e-05, 'epoch': 2.2606382978723403}
{'loss': 0.0967, 'grad_norm': 0.24240602552890778, 'learning_rate': 2.1582045438184463e-05, 'epoch': 2.393617021276596}
{'loss': 0.0986, 'grad_norm': 0.19192545115947723, 'learning_rate': 1.334856663973003e-05, 'epoch': 2.526595744680851}
{'loss': 0.0957, 'grad_norm': 0.1926991194486618, 'learning_rate': 6.979181994870587e-06, 'epoch': 2.6595744680851063}
{'loss': 0.0968, 'grad_norm': 0.6532518267631531, 'learning_rate': 2.6109132725262164e-06, 'epoch': 2.7925531914893615}
{'loss': 0.0991, 'grad_norm': 0.260436475276947, 'learning_rate': 3.3773311539742057e-07, 'epoch': 2.925531914893617}
{'train_runtime': 145.2311, 'train_samples_per_second': 30.985, 'train_steps_per_second': 3.883, 'train_loss': 0.15920729417327448, 'epoch': 3.0}
[2026-05-05 00:11:30] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / gsm8k
[2026-05-05 00:11:30] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / openbookqa -> /workspace/round3_out/round4/X/openbookqa
{'loss': 3.5661, 'grad_norm': 12.177047729492188, 'learning_rate': 6.896551724137932e-06, 'epoch': 0.005319148936170213}
{'loss': 1.8305, 'grad_norm': 0.5522000789642334, 'learning_rate': 0.00017241379310344826, 'epoch': 0.13297872340425532}
{'loss': 0.8422, 'grad_norm': 0.3776131868362427, 'learning_rate': 0.00019924063537459386, 'epoch': 0.26595744680851063}
{'loss': 0.8078, 'grad_norm': 0.33335211873054504, 'learning_rate': 0.00019637393494757147, 'epoch': 0.39893617021276595}
{'loss': 0.839, 'grad_norm': 0.39055490493774414, 'learning_rate': 0.00019143398446884148, 'epoch': 0.5319148936170213}
{'loss': 0.8351, 'grad_norm': 0.5157647728919983, 'learning_rate': 0.00018452705491915232, 'epoch': 0.6648936170212766}
{'loss': 0.826, 'grad_norm': 0.3225862383842468, 'learning_rate': 0.00017580173203440679, 'epoch': 0.7978723404255319}
{'loss': 0.8019, 'grad_norm': 0.3225404918193817, 'learning_rate': 0.00016544571984611307, 'epoch': 0.9308510638297872}
{'loss': 0.8003, 'grad_norm': 0.3728056848049164, 'learning_rate': 0.00015368180268715678, 'epoch': 1.0638297872340425}
{'loss': 0.6933, 'grad_norm': 0.5289945602416992, 'learning_rate': 0.00014076305253048747, 'epoch': 1.196808510638298}
{'loss': 0.6872, 'grad_norm': 0.6276748776435852, 'learning_rate': 0.00012696738476313262, 'epoch': 1.3297872340425532}
{'loss': 0.7, 'grad_norm': 0.6584810614585876, 'learning_rate': 0.0001125915795147773, 'epoch': 1.4627659574468086}
{'loss': 0.6711, 'grad_norm': 0.5923119187355042, 'learning_rate': 9.79448971574372e-05, 'epoch': 1.5957446808510638}
{'loss': 0.7016, 'grad_norm': 0.6503795981407166, 'learning_rate': 8.334242532316977e-05, 'epoch': 1.728723404255319}
{'loss': 0.6798, 'grad_norm': 0.6151637434959412, 'learning_rate': 6.909830056250527e-05, 'epoch': 1.8617021276595744}
{'loss': 0.6883, 'grad_norm': 0.6175880432128906, 'learning_rate': 5.55189504630756e-05, 'epoch': 1.9946808510638299}
{'loss': 0.5099, 'grad_norm': 0.7328499555587769, 'learning_rate': 4.289650160776967e-05, 'epoch': 2.127659574468085}
{'loss': 0.4856, 'grad_norm': 1.0046825408935547, 'learning_rate': 3.1502495184110666e-05, 'epoch': 2.2606382978723403}
{'loss': 0.4837, 'grad_norm': 0.949108362197876, 'learning_rate': 2.1582045438184463e-05, 'epoch': 2.393617021276596}
{'loss': 0.4851, 'grad_norm': 1.040684700012207, 'learning_rate': 1.334856663973003e-05, 'epoch': 2.526595744680851}
{'loss': 0.4774, 'grad_norm': 1.0464662313461304, 'learning_rate': 6.979181994870587e-06, 'epoch': 2.6595744680851063}
{'loss': 0.4777, 'grad_norm': 1.0493898391723633, 'learning_rate': 2.6109132725262164e-06, 'epoch': 2.7925531914893615}
{'loss': 0.4864, 'grad_norm': 1.0567758083343506, 'learning_rate': 3.3773311539742057e-07, 'epoch': 2.925531914893617}
{'train_runtime': 145.8834, 'train_samples_per_second': 30.847, 'train_steps_per_second': 3.866, 'train_loss': 0.7155104930519213, 'epoch': 3.0}
[2026-05-05 00:11:30] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / arc_easy
[2026-05-05 00:11:30] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / mmlu_high_school_biology -> /workspace/round3_out/round4/X/mmlu_high_school_biology
{'loss': 4.3696, 'grad_norm': 16.266511917114258, 'learning_rate': 6.896551724137932e-06, 'epoch': 0.005319148936170213}
{'loss': 1.963, 'grad_norm': 0.8525881767272949, 'learning_rate': 0.00017241379310344826, 'epoch': 0.13297872340425532}
{'loss': 0.7989, 'grad_norm': 0.5012720823287964, 'learning_rate': 0.00019924063537459386, 'epoch': 0.26595744680851063}
{'loss': 0.7753, 'grad_norm': 0.39920374751091003, 'learning_rate': 0.00019637393494757147, 'epoch': 0.39893617021276595}
{'loss': 0.75, 'grad_norm': 0.3587755560874939, 'learning_rate': 0.00019143398446884148, 'epoch': 0.5319148936170213}
{'loss': 0.7597, 'grad_norm': 0.34211042523384094, 'learning_rate': 0.00018452705491915232, 'epoch': 0.6648936170212766}
{'loss': 0.717, 'grad_norm': 0.34741517901420593, 'learning_rate': 0.00017580173203440679, 'epoch': 0.7978723404255319}
{'loss': 0.7466, 'grad_norm': 0.34807413816452026, 'learning_rate': 0.00016544571984611307, 'epoch': 0.9308510638297872}
{'loss': 0.7038, 'grad_norm': 0.33415141701698303, 'learning_rate': 0.00015368180268715678, 'epoch': 1.0638297872340425}
{'loss': 0.6372, 'grad_norm': 0.5864014029502869, 'learning_rate': 0.00014076305253048747, 'epoch': 1.196808510638298}
{'loss': 0.6112, 'grad_norm': 0.5808629989624023, 'learning_rate': 0.00012696738476313262, 'epoch': 1.3297872340425532}
{'loss': 0.642, 'grad_norm': 0.6848379969596863, 'learning_rate': 0.0001125915795147773, 'epoch': 1.4627659574468086}
{'loss': 0.6379, 'grad_norm': 0.66339510679245, 'learning_rate': 9.79448971574372e-05, 'epoch': 1.5957446808510638}
{'loss': 0.6206, 'grad_norm': 0.6259959936141968, 'learning_rate': 8.334242532316977e-05, 'epoch': 1.728723404255319}
{'loss': 0.6327, 'grad_norm': 0.7516557574272156, 'learning_rate': 6.909830056250527e-05, 'epoch': 1.8617021276595744}
{'loss': 0.6077, 'grad_norm': 0.7318317890167236, 'learning_rate': 5.55189504630756e-05, 'epoch': 1.9946808510638299}
{'loss': 0.482, 'grad_norm': 0.8025975823402405, 'learning_rate': 4.289650160776967e-05, 'epoch': 2.127659574468085}
{'loss': 0.4488, 'grad_norm': 1.0984547138214111, 'learning_rate': 3.1502495184110666e-05, 'epoch': 2.2606382978723403}
{'loss': 0.4461, 'grad_norm': 1.1453553438186646, 'learning_rate': 2.1582045438184463e-05, 'epoch': 2.393617021276596}
{'loss': 0.4491, 'grad_norm': 0.9386950731277466, 'learning_rate': 1.334856663973003e-05, 'epoch': 2.526595744680851}
{'loss': 0.4392, 'grad_norm': 1.0696130990982056, 'learning_rate': 6.979181994870587e-06, 'epoch': 2.6595744680851063}
{'loss': 0.4287, 'grad_norm': 0.9677240252494812, 'learning_rate': 2.6109132725262164e-06, 'epoch': 2.7925531914893615}
{'loss': 0.442, 'grad_norm': 1.0220279693603516, 'learning_rate': 3.3773311539742057e-07, 'epoch': 2.925531914893617}
{'train_runtime': 146.3568, 'train_samples_per_second': 30.747, 'train_steps_per_second': 3.854, 'train_loss': 0.6682407915169466, 'epoch': 3.0}
[2026-05-05 00:11:31] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / sciq
[2026-05-05 00:11:31] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / multiarith -> /workspace/round3_out/round4/X/multiarith
{'loss': 3.4117, 'grad_norm': 3.6093618869781494, 'learning_rate': 3.3333333333333335e-05, 'epoch': 0.02564102564102564}
{'loss': 1.695, 'grad_norm': 0.6251434087753296, 'learning_rate': 0.00018588632672204264, 'epoch': 0.6410256410256411}
{'loss': 1.0473, 'grad_norm': 0.6122713685035706, 'learning_rate': 0.00013197639245712454, 'epoch': 1.282051282051282}
{'loss': 0.9305, 'grad_norm': 0.8628207445144653, 'learning_rate': 6.271435222196916e-05, 'epoch': 1.9230769230769231}
{'loss': 0.798, 'grad_norm': 0.845122754573822, 'learning_rate': 1.1353431277390126e-05, 'epoch': 2.564102564102564}
{'train_runtime': 24.1792, 'train_samples_per_second': 38.463, 'train_steps_per_second': 4.839, 'train_loss': 1.077816262204423, 'epoch': 3.0}
[2026-05-05 00:11:32] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / mmlu_high_school_biology
[2026-05-05 00:11:32] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / mbpp_sanitized -> /workspace/round3_out/round4/Y/mbpp_sanitized
{'loss': 3.8591, 'grad_norm': 5.206538200378418, 'learning_rate': 2.5e-05, 'epoch': 0.018867924528301886}
{'loss': 1.6708, 'grad_norm': 0.8687698841094971, 'learning_rate': 0.00019381012910506146, 'epoch': 0.4716981132075472}
{'loss': 0.594, 'grad_norm': 1.3481507301330566, 'learning_rate': 0.00016419017501926656, 'epoch': 0.9433962264150944}
{'loss': 0.3535, 'grad_norm': 0.8465226888656616, 'learning_rate': 0.00011759242861991855, 'epoch': 1.4150943396226414}
{'loss': 0.2719, 'grad_norm': 0.774315357208252, 'learning_rate': 6.63416243451194e-05, 'epoch': 1.8867924528301887}
{'loss': 0.2201, 'grad_norm': 0.5559712052345276, 'learning_rate': 2.399319354583418e-05, 'epoch': 2.358490566037736}
{'loss': 0.1935, 'grad_norm': 0.6623753309249878, 'learning_rate': 1.7479603777742938e-06, 'epoch': 2.830188679245283}
{'train_runtime': 31.8636, 'train_samples_per_second': 39.544, 'train_steps_per_second': 4.99, 'train_loss': 0.544288547533863, 'epoch': 3.0}
[2026-05-05 00:11:38] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / multiarith
[2026-05-05 00:11:38] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / mmlu_high_school_physics -> /workspace/round3_out/round4/Y/mmlu_high_school_physics
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 3.6115, 'grad_norm': 5.521681785583496, 'learning_rate': 2e-05, 'epoch': 0.015384615384615385}
{'loss': 1.5557, 'grad_norm': 0.7814257144927979, 'learning_rate': 0.0001967732946933499, 'epoch': 0.38461538461538464}
{'loss': 0.619, 'grad_norm': 0.512479305267334, 'learning_rate': 0.00017780357543184397, 'epoch': 0.7692307692307693}
{'loss': 0.6114, 'grad_norm': 0.4772077798843384, 'learning_rate': 0.00014502037448176734, 'epoch': 1.1538461538461537}
{'loss': 0.5125, 'grad_norm': 0.5385656952857971, 'learning_rate': 0.00010424412031961484, 'epoch': 1.5384615384615383}
{'loss': 0.4892, 'grad_norm': 0.6602227091789246, 'learning_rate': 6.271435222196916e-05, 'epoch': 1.9230769230769231}
{'loss': 0.3967, 'grad_norm': 0.8214908242225647, 'learning_rate': 2.7804390604547557e-05, 'epoch': 2.3076923076923075}
{'loss': 0.3458, 'grad_norm': 0.8156642317771912, 'learning_rate': 5.71225545389158e-06, 'epoch': 2.6923076923076925}
{'train_runtime': 45.2201, 'train_samples_per_second': 34.365, 'train_steps_per_second': 4.312, 'train_loss': 0.6290410689818553, 'epoch': 3.0}
[2026-05-05 00:11:55] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / math_algebra_easy
[2026-05-05 00:11:55] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / gsm8k_test_500 -> /workspace/round3_out/round4/Y/gsm8k_test_500
{'loss': 1.9858, 'grad_norm': 2.212726354598999, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.06666666666666667}
{'loss': 1.1076, 'grad_norm': 0.45463812351226807, 'learning_rate': 9.252699064135758e-05, 'epoch': 1.6666666666666665}
{'train_runtime': 14.9434, 'train_samples_per_second': 24.091, 'train_steps_per_second': 3.011, 'train_loss': 0.9353794945610894, 'epoch': 3.0}
[2026-05-05 00:12:03] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / mbpp_sanitized
[2026-05-05 00:12:03] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / medmcqa_easy -> /workspace/round3_out/round4/Y/medmcqa_easy
{'loss': 2.993, 'grad_norm': 3.596569538116455, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.05263157894736842}
{'loss': 1.2964, 'grad_norm': 0.5417295694351196, 'learning_rate': 0.00012868032327110904, 'epoch': 1.3157894736842106}
{'loss': 0.765, 'grad_norm': 0.590907871723175, 'learning_rate': 8.178389311972612e-06, 'epoch': 2.6315789473684212}
{'train_runtime': 13.1368, 'train_samples_per_second': 34.483, 'train_steps_per_second': 4.339, 'train_loss': 1.0166657891189843, 'epoch': 3.0}
[2026-05-05 00:12:06] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / mmlu_high_school_physics
[2026-05-05 00:12:06] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / aqua_rat -> /workspace/round3_out/round4/Y/aqua_rat
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
{'loss': 3.3715, 'grad_norm': 10.471955299377441, 'learning_rate': 3.3333333333333335e-05, 'epoch': 0.02564102564102564}
{'loss': 1.3863, 'grad_norm': 0.42306047677993774, 'learning_rate': 0.00018588632672204264, 'epoch': 0.6410256410256411}
{'loss': 0.9425, 'grad_norm': 0.44168543815612793, 'learning_rate': 0.00013197639245712454, 'epoch': 1.282051282051282}
{'loss': 0.8507, 'grad_norm': 0.6152482032775879, 'learning_rate': 6.271435222196916e-05, 'epoch': 1.9230769230769231}
{'loss': 0.7261, 'grad_norm': 0.725269615650177, 'learning_rate': 1.1353431277390126e-05, 'epoch': 2.564102564102564}
{'train_runtime': 32.4536, 'train_samples_per_second': 28.656, 'train_steps_per_second': 3.605, 'train_loss': 0.9477245074052078, 'epoch': 3.0}
[2026-05-05 00:12:18] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / mmlu_high_school_biology
[2026-05-05 00:12:18] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / mbpp_sanitized -> /workspace/round3_out/round4/X/mbpp_sanitized
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 3.1268, 'grad_norm': 15.000044822692871, 'learning_rate': 2e-05, 'epoch': 0.015384615384615385}
{'loss': 0.7974, 'grad_norm': 0.8673375248908997, 'learning_rate': 0.0001967732946933499, 'epoch': 0.38461538461538464}
{'loss': 0.1914, 'grad_norm': 0.4226300120353699, 'learning_rate': 0.00017780357543184397, 'epoch': 0.7692307692307693}
{'loss': 0.1572, 'grad_norm': 0.5657345056533813, 'learning_rate': 0.00014502037448176734, 'epoch': 1.1538461538461537}
{'loss': 0.1358, 'grad_norm': 0.37566131353378296, 'learning_rate': 0.00010424412031961484, 'epoch': 1.5384615384615383}
{'loss': 0.1323, 'grad_norm': 0.4015530049800873, 'learning_rate': 6.271435222196916e-05, 'epoch': 1.9230769230769231}
{'loss': 0.1137, 'grad_norm': 0.3176952600479126, 'learning_rate': 2.7804390604547557e-05, 'epoch': 2.3076923076923075}
{'loss': 0.1058, 'grad_norm': 0.31076669692993164, 'learning_rate': 5.71225545389158e-06, 'epoch': 2.6923076923076925}
{'train_runtime': 57.3587, 'train_samples_per_second': 27.093, 'train_steps_per_second': 3.4, 'train_loss': 0.23199066993517753, 'epoch': 3.0}
[2026-05-05 00:12:25] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / math_algebra_easy
[2026-05-05 00:12:25] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / gsm8k_test_500 -> /workspace/round3_out/round4/X/gsm8k_test_500
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 3.1334, 'grad_norm': 15.008722305297852, 'learning_rate': 2.5e-05, 'epoch': 0.018867924528301886}
{'loss': 0.864, 'grad_norm': 0.9282662272453308, 'learning_rate': 0.00019381012910506146, 'epoch': 0.4716981132075472}
{'loss': 0.2459, 'grad_norm': 0.6726178526878357, 'learning_rate': 0.00016419017501926656, 'epoch': 0.9433962264150944}
{'loss': 0.1691, 'grad_norm': 0.5856167674064636, 'learning_rate': 0.00011759242861991855, 'epoch': 1.4150943396226414}
{'loss': 0.157, 'grad_norm': 0.6802059412002563, 'learning_rate': 6.63416243451194e-05, 'epoch': 1.8867924528301887}
{'loss': 0.1317, 'grad_norm': 0.45471635460853577, 'learning_rate': 2.399319354583418e-05, 'epoch': 2.358490566037736}
{'loss': 0.1085, 'grad_norm': 0.5222012996673584, 'learning_rate': 1.7479603777742938e-06, 'epoch': 2.830188679245283}
{'train_runtime': 43.7541, 'train_samples_per_second': 28.797, 'train_steps_per_second': 3.634, 'train_loss': 0.283983346801134, 'epoch': 3.0}
[2026-05-05 00:12:31] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / multiarith
[2026-05-05 00:12:31] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / mmlu_high_school_physics -> /workspace/round3_out/round4/X/mmlu_high_school_physics
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 1.7788, 'grad_norm': 4.910188674926758, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.06666666666666667}
{'loss': 0.8164, 'grad_norm': 0.4361398220062256, 'learning_rate': 9.252699064135758e-05, 'epoch': 1.6666666666666665}
{'train_runtime': 16.3656, 'train_samples_per_second': 21.997, 'train_steps_per_second': 2.75, 'train_loss': 0.6839464664459228, 'epoch': 3.0}
[2026-05-05 00:12:47] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / mbpp_sanitized
[2026-05-05 00:12:47] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / medmcqa_easy -> /workspace/round3_out/round4/X/medmcqa_easy
{'loss': 3.3547, 'grad_norm': 4.238654613494873, 'learning_rate': 2e-05, 'epoch': 0.015873015873015872}
{'loss': 1.8066, 'grad_norm': 0.6965137720108032, 'learning_rate': 0.00019655462532757676, 'epoch': 0.3968253968253968}
{'loss': 1.1036, 'grad_norm': 0.5499321222305298, 'learning_rate': 0.0001763531637669949, 'epoch': 0.7936507936507936}
{'loss': 1.0027, 'grad_norm': 0.5470696687698364, 'learning_rate': 0.00014168658260281945, 'epoch': 1.1904761904761905}
{'loss': 0.9306, 'grad_norm': 0.6922482848167419, 'learning_rate': 9.912247141546079e-05, 'epoch': 1.5873015873015874}
{'loss': 0.9188, 'grad_norm': 0.7560460567474365, 'learning_rate': 5.672460816472556e-05, 'epoch': 1.9841269841269842}
{'loss': 0.7756, 'grad_norm': 0.879742443561554, 'learning_rate': 2.2525275111113807e-05, 'epoch': 2.380952380952381}
{'loss': 0.7524, 'grad_norm': 0.9037792086601257, 'learning_rate': 3.003541602600157e-06, 'epoch': 2.7777777777777777}
{'train_runtime': 40.544, 'train_samples_per_second': 36.997, 'train_steps_per_second': 4.662, 'train_loss': 1.0285368962262673, 'epoch': 3.0}
[2026-05-05 00:12:49] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / gsm8k_test_500
[2026-05-05 00:12:49] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / openbookqa_test -> /workspace/round3_out/round4/Y/openbookqa_test
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
{'loss': 2.8817, 'grad_norm': 10.458314895629883, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.05263157894736842}
{'loss': 1.0022, 'grad_norm': 0.32482507824897766, 'learning_rate': 0.00012868032327110904, 'epoch': 1.3157894736842106}
{'loss': 0.6225, 'grad_norm': 0.4488460421562195, 'learning_rate': 8.178389311972612e-06, 'epoch': 2.6315789473684212}
{'train_runtime': 15.7695, 'train_samples_per_second': 28.726, 'train_steps_per_second': 3.615, 'train_loss': 0.8166922100803309, 'epoch': 3.0}
[2026-05-05 00:13:00] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / mmlu_high_school_physics
[2026-05-05 00:13:00] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / aqua_rat -> /workspace/round3_out/round4/X/aqua_rat
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 4.272, 'grad_norm': 5.605335235595703, 'learning_rate': 6.896551724137932e-06, 'epoch': 0.005319148936170213}
{'loss': 2.679, 'grad_norm': 0.9552299380302429, 'learning_rate': 0.00017241379310344826, 'epoch': 0.13297872340425532}
{'loss': 1.1207, 'grad_norm': 0.5910282731056213, 'learning_rate': 0.00019924063537459386, 'epoch': 0.26595744680851063}
{'loss': 1.0485, 'grad_norm': 0.5686326622962952, 'learning_rate': 0.00019637393494757147, 'epoch': 0.39893617021276595}
{'loss': 1.0091, 'grad_norm': 0.5279734134674072, 'learning_rate': 0.00019143398446884148, 'epoch': 0.5319148936170213}
{'loss': 1.0313, 'grad_norm': 0.4510713815689087, 'learning_rate': 0.00018452705491915232, 'epoch': 0.6648936170212766}
{'loss': 1.0312, 'grad_norm': 0.6114181876182556, 'learning_rate': 0.00017580173203440679, 'epoch': 0.7978723404255319}
{'loss': 1.0182, 'grad_norm': 0.4982607960700989, 'learning_rate': 0.00016544571984611307, 'epoch': 0.9308510638297872}
{'loss': 1.0032, 'grad_norm': 0.5455285310745239, 'learning_rate': 0.00015368180268715678, 'epoch': 1.0638297872340425}
{'loss': 0.8999, 'grad_norm': 0.7275230288505554, 'learning_rate': 0.00014076305253048747, 'epoch': 1.196808510638298}
{'loss': 0.8708, 'grad_norm': 0.8493562340736389, 'learning_rate': 0.00012696738476313262, 'epoch': 1.3297872340425532}
{'loss': 0.8976, 'grad_norm': 0.6929630637168884, 'learning_rate': 0.0001125915795147773, 'epoch': 1.4627659574468086}
{'loss': 0.8924, 'grad_norm': 0.628284215927124, 'learning_rate': 9.79448971574372e-05, 'epoch': 1.5957446808510638}
{'loss': 0.8896, 'grad_norm': 0.6749979853630066, 'learning_rate': 8.334242532316977e-05, 'epoch': 1.728723404255319}
{'loss': 0.8933, 'grad_norm': 0.8211554884910583, 'learning_rate': 6.909830056250527e-05, 'epoch': 1.8617021276595744}
{'loss': 0.8763, 'grad_norm': 0.8183001279830933, 'learning_rate': 5.55189504630756e-05, 'epoch': 1.9946808510638299}
{'loss': 0.7017, 'grad_norm': 1.1786373853683472, 'learning_rate': 4.289650160776967e-05, 'epoch': 2.127659574468085}
{'loss': 0.6636, 'grad_norm': 1.1581196784973145, 'learning_rate': 3.1502495184110666e-05, 'epoch': 2.2606382978723403}
{'loss': 0.6781, 'grad_norm': 1.2269527912139893, 'learning_rate': 2.1582045438184463e-05, 'epoch': 2.393617021276596}
{'loss': 0.6588, 'grad_norm': 1.1912201642990112, 'learning_rate': 1.334856663973003e-05, 'epoch': 2.526595744680851}
{'loss': 0.6689, 'grad_norm': 1.0328978300094604, 'learning_rate': 6.979181994870587e-06, 'epoch': 2.6595744680851063}
{'loss': 0.6446, 'grad_norm': 1.1028183698654175, 'learning_rate': 2.6109132725262164e-06, 'epoch': 2.7925531914893615}
{'loss': 0.6392, 'grad_norm': 1.2358579635620117, 'learning_rate': 3.3773311539742057e-07, 'epoch': 2.925531914893617}
{'train_runtime': 117.7817, 'train_samples_per_second': 38.206, 'train_steps_per_second': 4.789, 'train_loss': 0.9416530859385822, 'epoch': 3.0}
[2026-05-05 00:13:08] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / openbookqa
[2026-05-05 00:13:08] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / math_counting_easy -> /workspace/round3_out/round4/Y/math_counting_easy
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 3.4685, 'grad_norm': 12.454188346862793, 'learning_rate': 2e-05, 'epoch': 0.015873015873015872}
{'loss': 1.5572, 'grad_norm': 0.42920956015586853, 'learning_rate': 0.00019655462532757676, 'epoch': 0.3968253968253968}
{'loss': 0.9914, 'grad_norm': 0.4040983319282532, 'learning_rate': 0.0001763531637669949, 'epoch': 0.7936507936507936}
{'loss': 0.9252, 'grad_norm': 0.4234474301338196, 'learning_rate': 0.00014168658260281945, 'epoch': 1.1904761904761905}
{'loss': 0.8525, 'grad_norm': 0.5223008394241333, 'learning_rate': 9.912247141546079e-05, 'epoch': 1.5873015873015874}
{'loss': 0.8353, 'grad_norm': 0.6573077440261841, 'learning_rate': 5.672460816472556e-05, 'epoch': 1.9841269841269842}
{'loss': 0.689, 'grad_norm': 0.714469850063324, 'learning_rate': 2.2525275111113807e-05, 'epoch': 2.380952380952381}
{'loss': 0.6645, 'grad_norm': 0.799299955368042, 'learning_rate': 3.003541602600157e-06, 'epoch': 2.7777777777777777}
{'train_runtime': 50.4965, 'train_samples_per_second': 29.705, 'train_steps_per_second': 3.743, 'train_loss': 0.9211538443489681, 'epoch': 3.0}
[2026-05-05 00:13:28] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / gsm8k_test_500
[2026-05-05 00:13:28] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / openbookqa_test -> /workspace/round3_out/round4/X/openbookqa_test
{'loss': 2.948, 'grad_norm': 4.251391887664795, 'learning_rate': 5e-05, 'epoch': 0.045454545454545456}
{'loss': 1.327, 'grad_norm': 0.9287701845169067, 'learning_rate': 0.00014853019625310813, 'epoch': 1.1363636363636362}
{'loss': 0.5546, 'grad_norm': 0.6912827491760254, 'learning_rate': 3.110330809243134e-05, 'epoch': 2.2727272727272725}
{'train_runtime': 15.3106, 'train_samples_per_second': 33.114, 'train_steps_per_second': 4.311, 'train_loss': 0.8481901125474409, 'epoch': 3.0}
[2026-05-05 00:13:36] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / math_counting_easy
[2026-05-05 00:13:36] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / mmlu_elementary_math -> /workspace/round3_out/round4/Y/mmlu_elementary_math
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 4.3941, 'grad_norm': 5.537446975708008, 'learning_rate': 2e-05, 'epoch': 0.015873015873015872}
{'loss': 2.0433, 'grad_norm': 0.7254217267036438, 'learning_rate': 0.00019655462532757676, 'epoch': 0.3968253968253968}
{'loss': 1.0374, 'grad_norm': 0.5049721002578735, 'learning_rate': 0.0001763531637669949, 'epoch': 0.7936507936507936}
{'loss': 0.967, 'grad_norm': 0.5710312724113464, 'learning_rate': 0.00014168658260281945, 'epoch': 1.1904761904761905}
{'loss': 0.8864, 'grad_norm': 0.7858477830886841, 'learning_rate': 9.912247141546079e-05, 'epoch': 1.5873015873015874}
{'loss': 0.8698, 'grad_norm': 0.7653865218162537, 'learning_rate': 5.672460816472556e-05, 'epoch': 1.9841269841269842}
{'loss': 0.7312, 'grad_norm': 1.08513343334198, 'learning_rate': 2.2525275111113807e-05, 'epoch': 2.380952380952381}
{'loss': 0.6831, 'grad_norm': 1.0602993965148926, 'learning_rate': 3.003541602600157e-06, 'epoch': 2.7777777777777777}
{'train_runtime': 38.9841, 'train_samples_per_second': 38.477, 'train_steps_per_second': 4.848, 'train_loss': 1.0193275345696344, 'epoch': 3.0}
[2026-05-05 00:13:41] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / openbookqa_test
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 3.9832, 'grad_norm': 5.3925251960754395, 'learning_rate': 6.896551724137932e-06, 'epoch': 0.005319148936170213}
{'loss': 2.2854, 'grad_norm': 1.0210996866226196, 'learning_rate': 0.00017241379310344826, 'epoch': 0.13297872340425532}
{'loss': 0.8873, 'grad_norm': 0.7830055952072144, 'learning_rate': 0.00019924063537459386, 'epoch': 0.26595744680851063}
{'loss': 0.8513, 'grad_norm': 0.5527425408363342, 'learning_rate': 0.00019637393494757147, 'epoch': 0.39893617021276595}
{'loss': 0.8347, 'grad_norm': 0.6603056192398071, 'learning_rate': 0.00019143398446884148, 'epoch': 0.5319148936170213}
{'loss': 0.831, 'grad_norm': 0.6267145276069641, 'learning_rate': 0.00018452705491915232, 'epoch': 0.6648936170212766}
{'loss': 0.8566, 'grad_norm': 0.6092323064804077, 'learning_rate': 0.00017580173203440679, 'epoch': 0.7978723404255319}
{'loss': 0.8189, 'grad_norm': 0.6118431687355042, 'learning_rate': 0.00016544571984611307, 'epoch': 0.9308510638297872}
{'loss': 0.736, 'grad_norm': 0.4591735899448395, 'learning_rate': 0.00015368180268715678, 'epoch': 1.0638297872340425}
{'loss': 0.6933, 'grad_norm': 0.6498691439628601, 'learning_rate': 0.00014076305253048747, 'epoch': 1.196808510638298}
{'loss': 0.6717, 'grad_norm': 0.766682505607605, 'learning_rate': 0.00012696738476313262, 'epoch': 1.3297872340425532}
{'loss': 0.6361, 'grad_norm': 0.7608623504638672, 'learning_rate': 0.0001125915795147773, 'epoch': 1.4627659574468086}
{'loss': 0.6798, 'grad_norm': 0.7739782333374023, 'learning_rate': 9.79448971574372e-05, 'epoch': 1.5957446808510638}
{'loss': 0.6539, 'grad_norm': 0.7546813488006592, 'learning_rate': 8.334242532316977e-05, 'epoch': 1.728723404255319}
{'loss': 0.6486, 'grad_norm': 0.769869327545166, 'learning_rate': 6.909830056250527e-05, 'epoch': 1.8617021276595744}
{'loss': 0.6418, 'grad_norm': 0.7604761123657227, 'learning_rate': 5.55189504630756e-05, 'epoch': 1.9946808510638299}
{'loss': 0.4578, 'grad_norm': 0.9888116717338562, 'learning_rate': 4.289650160776967e-05, 'epoch': 2.127659574468085}
{'loss': 0.4306, 'grad_norm': 0.8357949256896973, 'learning_rate': 3.1502495184110666e-05, 'epoch': 2.2606382978723403}
{'loss': 0.4632, 'grad_norm': 1.0413919687271118, 'learning_rate': 2.1582045438184463e-05, 'epoch': 2.393617021276596}
{'loss': 0.4525, 'grad_norm': 1.015708565711975, 'learning_rate': 1.334856663973003e-05, 'epoch': 2.526595744680851}
{'loss': 0.4444, 'grad_norm': 0.8903518915176392, 'learning_rate': 6.979181994870587e-06, 'epoch': 2.6595744680851063}
{'loss': 0.4328, 'grad_norm': 0.8709405660629272, 'learning_rate': 2.6109132725262164e-06, 'epoch': 2.7925531914893615}
{'loss': 0.4669, 'grad_norm': 0.863745391368866, 'learning_rate': 3.3773311539742057e-07, 'epoch': 2.925531914893617}
{'train_runtime': 113.6116, 'train_samples_per_second': 39.609, 'train_steps_per_second': 4.964, 'train_loss': 0.7181437522807019, 'epoch': 3.0}
[2026-05-05 00:14:12] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / medmcqa_easy
[2026-05-05 00:14:12] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / mbpp_plus -> /workspace/round3_out/round4/Y/mbpp_plus
{'loss': 3.2218, 'grad_norm': 4.107120037078857, 'learning_rate': 6.896551724137932e-06, 'epoch': 0.005319148936170213}
{'loss': 2.0251, 'grad_norm': 0.6516982913017273, 'learning_rate': 0.00017241379310344826, 'epoch': 0.13297872340425532}
{'loss': 0.9255, 'grad_norm': 0.5298068523406982, 'learning_rate': 0.00019924063537459386, 'epoch': 0.26595744680851063}
{'loss': 0.8386, 'grad_norm': 0.6183074712753296, 'learning_rate': 0.00019637393494757147, 'epoch': 0.39893617021276595}
{'loss': 0.8361, 'grad_norm': 0.4674525558948517, 'learning_rate': 0.00019143398446884148, 'epoch': 0.5319148936170213}
{'loss': 0.8165, 'grad_norm': 0.4218616485595703, 'learning_rate': 0.00018452705491915232, 'epoch': 0.6648936170212766}
{'loss': 0.7883, 'grad_norm': 0.4713094234466553, 'learning_rate': 0.00017580173203440679, 'epoch': 0.7978723404255319}
{'loss': 0.8139, 'grad_norm': 0.45018500089645386, 'learning_rate': 0.00016544571984611307, 'epoch': 0.9308510638297872}
{'loss': 0.7645, 'grad_norm': 0.5069326162338257, 'learning_rate': 0.00015368180268715678, 'epoch': 1.0638297872340425}
{'loss': 0.6803, 'grad_norm': 0.6308664083480835, 'learning_rate': 0.00014076305253048747, 'epoch': 1.196808510638298}
{'loss': 0.6729, 'grad_norm': 0.6938367486000061, 'learning_rate': 0.00012696738476313262, 'epoch': 1.3297872340425532}
{'loss': 0.6928, 'grad_norm': 0.6024667620658875, 'learning_rate': 0.0001125915795147773, 'epoch': 1.4627659574468086}
{'loss': 0.6724, 'grad_norm': 0.5488994717597961, 'learning_rate': 9.79448971574372e-05, 'epoch': 1.5957446808510638}
{'loss': 0.6701, 'grad_norm': 0.6150515675544739, 'learning_rate': 8.334242532316977e-05, 'epoch': 1.728723404255319}
{'loss': 0.6997, 'grad_norm': 0.6171398758888245, 'learning_rate': 6.909830056250527e-05, 'epoch': 1.8617021276595744}
{'loss': 0.6808, 'grad_norm': 0.8256254196166992, 'learning_rate': 5.55189504630756e-05, 'epoch': 1.9946808510638299}
{'loss': 0.5129, 'grad_norm': 0.9654618501663208, 'learning_rate': 4.289650160776967e-05, 'epoch': 2.127659574468085}
{'loss': 0.5254, 'grad_norm': 0.9222334623336792, 'learning_rate': 3.1502495184110666e-05, 'epoch': 2.2606382978723403}
{'loss': 0.5083, 'grad_norm': 0.8168906569480896, 'learning_rate': 2.1582045438184463e-05, 'epoch': 2.393617021276596}
{'loss': 0.5001, 'grad_norm': 0.9315916299819946, 'learning_rate': 1.334856663973003e-05, 'epoch': 2.526595744680851}
{'loss': 0.4718, 'grad_norm': 0.9013412594795227, 'learning_rate': 6.979181994870587e-06, 'epoch': 2.6595744680851063}
{'loss': 0.4961, 'grad_norm': 0.9204163551330566, 'learning_rate': 2.6109132725262164e-06, 'epoch': 2.7925531914893615}
{'loss': 0.4964, 'grad_norm': 0.9801384210586548, 'learning_rate': 3.3773311539742057e-07, 'epoch': 2.925531914893617}
{'train_runtime': 114.4333, 'train_samples_per_second': 39.324, 'train_steps_per_second': 4.929, 'train_loss': 0.7272339116597006, 'epoch': 3.0}
[2026-05-05 00:14:14] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / aqua_rat
[2026-05-05 00:14:14] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / mbpp_test_held -> /workspace/round3_out/round4/Y/mbpp_test_held
{'loss': 4.5146, 'grad_norm': 16.630313873291016, 'learning_rate': 6.896551724137932e-06, 'epoch': 0.005319148936170213}
{'loss': 2.2667, 'grad_norm': 0.6226514577865601, 'learning_rate': 0.00017241379310344826, 'epoch': 0.13297872340425532}
{'loss': 1.0255, 'grad_norm': 0.44983041286468506, 'learning_rate': 0.00019924063537459386, 'epoch': 0.26595744680851063}
{'loss': 1.0249, 'grad_norm': 0.4704210162162781, 'learning_rate': 0.00019637393494757147, 'epoch': 0.39893617021276595}
{'loss': 0.9964, 'grad_norm': 0.40280309319496155, 'learning_rate': 0.00019143398446884148, 'epoch': 0.5319148936170213}
{'loss': 1.0154, 'grad_norm': 0.3987154960632324, 'learning_rate': 0.00018452705491915232, 'epoch': 0.6648936170212766}
{'loss': 1.0262, 'grad_norm': 0.5383355021476746, 'learning_rate': 0.00017580173203440679, 'epoch': 0.7978723404255319}
{'loss': 1.0071, 'grad_norm': 0.3992350697517395, 'learning_rate': 0.00016544571984611307, 'epoch': 0.9308510638297872}
{'loss': 0.9934, 'grad_norm': 0.4367852807044983, 'learning_rate': 0.00015368180268715678, 'epoch': 1.0638297872340425}
{'loss': 0.9013, 'grad_norm': 0.6981058120727539, 'learning_rate': 0.00014076305253048747, 'epoch': 1.196808510638298}
{'loss': 0.8675, 'grad_norm': 0.6957953572273254, 'learning_rate': 0.00012696738476313262, 'epoch': 1.3297872340425532}
{'loss': 0.8841, 'grad_norm': 0.6769980192184448, 'learning_rate': 0.0001125915795147773, 'epoch': 1.4627659574468086}
{'loss': 0.8777, 'grad_norm': 0.6507949829101562, 'learning_rate': 9.79448971574372e-05, 'epoch': 1.5957446808510638}
{'loss': 0.8745, 'grad_norm': 0.6688486337661743, 'learning_rate': 8.334242532316977e-05, 'epoch': 1.728723404255319}
{'loss': 0.8777, 'grad_norm': 0.7726457715034485, 'learning_rate': 6.909830056250527e-05, 'epoch': 1.8617021276595744}
{'loss': 0.8695, 'grad_norm': 0.7908041477203369, 'learning_rate': 5.55189504630756e-05, 'epoch': 1.9946808510638299}
{'loss': 0.6927, 'grad_norm': 1.0163742303848267, 'learning_rate': 4.289650160776967e-05, 'epoch': 2.127659574468085}
{'loss': 0.6575, 'grad_norm': 1.1971701383590698, 'learning_rate': 3.1502495184110666e-05, 'epoch': 2.2606382978723403}
{'loss': 0.6703, 'grad_norm': 1.2276785373687744, 'learning_rate': 2.1582045438184463e-05, 'epoch': 2.393617021276596}
{'loss': 0.6535, 'grad_norm': 1.1885679960250854, 'learning_rate': 1.334856663973003e-05, 'epoch': 2.526595744680851}
{'loss': 0.6554, 'grad_norm': 1.0612844228744507, 'learning_rate': 6.979181994870587e-06, 'epoch': 2.6595744680851063}
{'loss': 0.6404, 'grad_norm': 1.2093658447265625, 'learning_rate': 2.6109132725262164e-06, 'epoch': 2.7925531914893615}
{'loss': 0.6244, 'grad_norm': 1.1670273542404175, 'learning_rate': 3.3773311539742057e-07, 'epoch': 2.925531914893617}
{'train_runtime': 148.7692, 'train_samples_per_second': 30.248, 'train_steps_per_second': 3.791, 'train_loss': 0.911190797251167, 'epoch': 3.0}
[2026-05-05 00:14:14] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / openbookqa
[2026-05-05 00:14:14] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / math_counting_easy -> /workspace/round3_out/round4/X/math_counting_easy
{'loss': 3.3588, 'grad_norm': 4.255439281463623, 'learning_rate': 2.5e-05, 'epoch': 0.020833333333333332}
{'loss': 1.5217, 'grad_norm': 0.6237989068031311, 'learning_rate': 0.0001923879532511287, 'epoch': 0.5208333333333334}
{'loss': 0.7339, 'grad_norm': 0.42053061723709106, 'learning_rate': 0.0001565136414422592, 'epoch': 1.0416666666666667}
{'loss': 0.6307, 'grad_norm': 0.6212886571884155, 'learning_rate': 0.00010230978916530012, 'epoch': 1.5625}
{'loss': 0.5997, 'grad_norm': 0.6173807382583618, 'learning_rate': 4.735678371226441e-05, 'epoch': 2.0833333333333335}
{'loss': 0.4913, 'grad_norm': 0.7990373969078064, 'learning_rate': 9.477991470251791e-06, 'epoch': 2.6041666666666665}
{'train_runtime': 28.0591, 'train_samples_per_second': 40.415, 'train_steps_per_second': 5.132, 'train_loss': 0.7664817919333776, 'epoch': 3.0}
[2026-05-05 00:14:16] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / mmlu_elementary_math
[2026-05-05 00:14:16] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / gsm_hard -> /workspace/round3_out/round4/Y/gsm_hard
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 4.6751, 'grad_norm': 17.089155197143555, 'learning_rate': 2e-05, 'epoch': 0.015873015873015872}
{'loss': 1.7225, 'grad_norm': 0.4432033896446228, 'learning_rate': 0.00019655462532757676, 'epoch': 0.3968253968253968}
{'loss': 0.9892, 'grad_norm': 0.419933557510376, 'learning_rate': 0.0001763531637669949, 'epoch': 0.7936507936507936}
{'loss': 0.9553, 'grad_norm': 0.4371388256549835, 'learning_rate': 0.00014168658260281945, 'epoch': 1.1904761904761905}
{'loss': 0.8658, 'grad_norm': 0.6926215291023254, 'learning_rate': 9.912247141546079e-05, 'epoch': 1.5873015873015874}
{'loss': 0.8429, 'grad_norm': 0.7686144113540649, 'learning_rate': 5.672460816472556e-05, 'epoch': 1.9841269841269842}
{'loss': 0.7046, 'grad_norm': 1.2619212865829468, 'learning_rate': 2.2525275111113807e-05, 'epoch': 2.380952380952381}
{'loss': 0.6523, 'grad_norm': 1.0051343441009521, 'learning_rate': 3.003541602600157e-06, 'epoch': 2.7777777777777777}
{'train_runtime': 49.4027, 'train_samples_per_second': 30.363, 'train_steps_per_second': 3.826, 'train_loss': 0.9559384880873262, 'epoch': 3.0}
[2026-05-05 00:14:30] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / openbookqa_test
{'loss': 2.6631, 'grad_norm': 3.135260581970215, 'learning_rate': 0.0001, 'epoch': 0.07692307692307693}
{'loss': 1.0242, 'grad_norm': 0.5291213989257812, 'learning_rate': 6.271435222196916e-05, 'epoch': 1.9230769230769231}
{'train_runtime': 11.7005, 'train_samples_per_second': 25.64, 'train_steps_per_second': 3.333, 'train_loss': 0.9326245112296863, 'epoch': 3.0}
[2026-05-05 00:14:41] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / mbpp_test_held
{'loss': 2.4397, 'grad_norm': 11.824714660644531, 'learning_rate': 5e-05, 'epoch': 0.045454545454545456}
{'loss': 0.6048, 'grad_norm': 0.4208642840385437, 'learning_rate': 0.00014853019625310813, 'epoch': 1.1363636363636362}
{'loss': 0.1343, 'grad_norm': 0.3433544337749481, 'learning_rate': 3.110330809243134e-05, 'epoch': 2.2727272727272725}
{'train_runtime': 18.2628, 'train_samples_per_second': 27.761, 'train_steps_per_second': 3.614, 'train_loss': 0.3346499132387566, 'epoch': 3.0}
[2026-05-05 00:14:48] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / math_counting_easy
[2026-05-05 00:14:48] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / mmlu_elementary_math -> /workspace/round3_out/round4/X/mmlu_elementary_math
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 2.3886, 'grad_norm': 2.93904972076416, 'learning_rate': 2.5e-05, 'epoch': 0.020833333333333332}
{'loss': 1.2149, 'grad_norm': 0.3937697410583496, 'learning_rate': 0.0001923879532511287, 'epoch': 0.5208333333333334}
{'loss': 0.6846, 'grad_norm': 0.4153788387775421, 'learning_rate': 0.0001565136414422592, 'epoch': 1.0416666666666667}
{'loss': 0.5757, 'grad_norm': 0.43175405263900757, 'learning_rate': 0.00010230978916530012, 'epoch': 1.5625}
{'loss': 0.5484, 'grad_norm': 0.46659091114997864, 'learning_rate': 4.735678371226441e-05, 'epoch': 2.0833333333333335}
{'loss': 0.4622, 'grad_norm': 0.5400645732879639, 'learning_rate': 9.477991470251791e-06, 'epoch': 2.6041666666666665}
{'train_runtime': 43.8096, 'train_samples_per_second': 25.885, 'train_steps_per_second': 3.287, 'train_loss': 0.674598667356703, 'epoch': 3.0}
[2026-05-05 00:15:11] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / mbpp_plus
{'loss': 4.354, 'grad_norm': 16.348665237426758, 'learning_rate': 6.896551724137932e-06, 'epoch': 0.005319148936170213}
{'loss': 1.9324, 'grad_norm': 0.7281725406646729, 'learning_rate': 0.00017241379310344826, 'epoch': 0.13297872340425532}
{'loss': 0.8267, 'grad_norm': 0.42819324135780334, 'learning_rate': 0.00019924063537459386, 'epoch': 0.26595744680851063}
{'loss': 0.8659, 'grad_norm': 0.3702390193939209, 'learning_rate': 0.00019637393494757147, 'epoch': 0.39893617021276595}
{'loss': 0.866, 'grad_norm': 0.46949636936187744, 'learning_rate': 0.00019143398446884148, 'epoch': 0.5319148936170213}
{'loss': 0.8478, 'grad_norm': 0.46469929814338684, 'learning_rate': 0.00018452705491915232, 'epoch': 0.6648936170212766}
{'loss': 0.8812, 'grad_norm': 0.42563918232917786, 'learning_rate': 0.00017580173203440679, 'epoch': 0.7978723404255319}
{'loss': 0.8314, 'grad_norm': 0.456299364566803, 'learning_rate': 0.00016544571984611307, 'epoch': 0.9308510638297872}
{'loss': 0.7688, 'grad_norm': 0.36802512407302856, 'learning_rate': 0.00015368180268715678, 'epoch': 1.0638297872340425}
{'loss': 0.7364, 'grad_norm': 0.6766768097877502, 'learning_rate': 0.00014076305253048747, 'epoch': 1.196808510638298}
{'loss': 0.7088, 'grad_norm': 0.6927001476287842, 'learning_rate': 0.00012696738476313262, 'epoch': 1.3297872340425532}
{'loss': 0.6744, 'grad_norm': 0.7402969002723694, 'learning_rate': 0.0001125915795147773, 'epoch': 1.4627659574468086}
{'loss': 0.7186, 'grad_norm': 0.7768665552139282, 'learning_rate': 9.79448971574372e-05, 'epoch': 1.5957446808510638}
{'loss': 0.6844, 'grad_norm': 0.7452237010002136, 'learning_rate': 8.334242532316977e-05, 'epoch': 1.728723404255319}
{'loss': 0.6877, 'grad_norm': 0.706362247467041, 'learning_rate': 6.909830056250527e-05, 'epoch': 1.8617021276595744}
{'loss': 0.6818, 'grad_norm': 0.7270861268043518, 'learning_rate': 5.55189504630756e-05, 'epoch': 1.9946808510638299}
{'loss': 0.5056, 'grad_norm': 0.9820103049278259, 'learning_rate': 4.289650160776967e-05, 'epoch': 2.127659574468085}
{'loss': 0.4628, 'grad_norm': 0.9028323888778687, 'learning_rate': 3.1502495184110666e-05, 'epoch': 2.2606382978723403}
{'loss': 0.5023, 'grad_norm': 0.9801334738731384, 'learning_rate': 2.1582045438184463e-05, 'epoch': 2.393617021276596}
{'loss': 0.4918, 'grad_norm': 0.90134596824646, 'learning_rate': 1.334856663973003e-05, 'epoch': 2.526595744680851}
{'loss': 0.4713, 'grad_norm': 0.8841328620910645, 'learning_rate': 6.979181994870587e-06, 'epoch': 2.6595744680851063}
{'loss': 0.4774, 'grad_norm': 1.0850573778152466, 'learning_rate': 2.6109132725262164e-06, 'epoch': 2.7925531914893615}
{'loss': 0.5061, 'grad_norm': 0.6751748919487, 'learning_rate': 3.3773311539742057e-07, 'epoch': 2.925531914893617}
{'train_runtime': 146.2355, 'train_samples_per_second': 30.772, 'train_steps_per_second': 3.857, 'train_loss': 0.7316120609324029, 'epoch': 3.0}
[2026-05-05 00:15:27] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / medmcqa_easy
[2026-05-05 00:15:27] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / mbpp_plus -> /workspace/round3_out/round4/X/mbpp_plus
{'loss': 3.1326, 'grad_norm': 12.43159008026123, 'learning_rate': 2.5e-05, 'epoch': 0.020833333333333332}
{'loss': 1.1004, 'grad_norm': 0.37570875883102417, 'learning_rate': 0.0001923879532511287, 'epoch': 0.5208333333333334}
{'loss': 0.5887, 'grad_norm': 0.3718211352825165, 'learning_rate': 0.0001565136414422592, 'epoch': 1.0416666666666667}
{'loss': 0.5174, 'grad_norm': 0.5157504081726074, 'learning_rate': 0.00010230978916530012, 'epoch': 1.5625}
{'loss': 0.4833, 'grad_norm': 0.6574680805206299, 'learning_rate': 4.735678371226441e-05, 'epoch': 2.0833333333333335}
{'loss': 0.3842, 'grad_norm': 0.6896154880523682, 'learning_rate': 9.477991470251791e-06, 'epoch': 2.6041666666666665}
{'train_runtime': 35.655, 'train_samples_per_second': 31.805, 'train_steps_per_second': 4.039, 'train_loss': 0.5963504877355363, 'epoch': 3.0}
[2026-05-05 00:15:36] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / mmlu_elementary_math
[2026-05-05 00:15:36] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / gsm_hard -> /workspace/round3_out/round4/X/gsm_hard
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 2.9565, 'grad_norm': 12.087135314941406, 'learning_rate': 6.896551724137932e-06, 'epoch': 0.005319148936170213}
{'loss': 1.3598, 'grad_norm': 0.6556797623634338, 'learning_rate': 0.00017241379310344826, 'epoch': 0.13297872340425532}
{'loss': 0.4601, 'grad_norm': 0.5966652035713196, 'learning_rate': 0.00019924063537459386, 'epoch': 0.26595744680851063}
{'loss': 0.4709, 'grad_norm': 0.6679794192314148, 'learning_rate': 0.00019637393494757147, 'epoch': 0.39893617021276595}
{'loss': 0.4815, 'grad_norm': 0.5698530077934265, 'learning_rate': 0.00019143398446884148, 'epoch': 0.5319148936170213}
{'loss': 0.4723, 'grad_norm': 0.6829456686973572, 'learning_rate': 0.00018452705491915232, 'epoch': 0.6648936170212766}
{'loss': 0.4335, 'grad_norm': 0.4312059283256531, 'learning_rate': 0.00017580173203440679, 'epoch': 0.7978723404255319}
{'loss': 0.433, 'grad_norm': 0.596960186958313, 'learning_rate': 0.00016544571984611307, 'epoch': 0.9308510638297872}
{'loss': 0.4009, 'grad_norm': 0.44516804814338684, 'learning_rate': 0.00015368180268715678, 'epoch': 1.0638297872340425}
{'loss': 0.3516, 'grad_norm': 0.6850377917289734, 'learning_rate': 0.00014076305253048747, 'epoch': 1.196808510638298}
{'loss': 0.3451, 'grad_norm': 0.5686097145080566, 'learning_rate': 0.00012696738476313262, 'epoch': 1.3297872340425532}
{'loss': 0.3553, 'grad_norm': 0.6032043099403381, 'learning_rate': 0.0001125915795147773, 'epoch': 1.4627659574468086}
{'loss': 0.3309, 'grad_norm': 0.340126633644104, 'learning_rate': 9.79448971574372e-05, 'epoch': 1.5957446808510638}
{'loss': 0.3298, 'grad_norm': 0.5898280143737793, 'learning_rate': 8.334242532316977e-05, 'epoch': 1.728723404255319}
{'loss': 0.3487, 'grad_norm': 0.4773308038711548, 'learning_rate': 6.909830056250527e-05, 'epoch': 1.8617021276595744}
{'loss': 0.3521, 'grad_norm': 0.7029126882553101, 'learning_rate': 5.55189504630756e-05, 'epoch': 1.9946808510638299}
{'loss': 0.2556, 'grad_norm': 0.5105568170547485, 'learning_rate': 4.289650160776967e-05, 'epoch': 2.127659574468085}
{'loss': 0.256, 'grad_norm': 0.5709134936332703, 'learning_rate': 3.1502495184110666e-05, 'epoch': 2.2606382978723403}
{'loss': 0.2578, 'grad_norm': 0.5221186280250549, 'learning_rate': 2.1582045438184463e-05, 'epoch': 2.393617021276596}
{'loss': 0.2546, 'grad_norm': 0.7243080139160156, 'learning_rate': 1.334856663973003e-05, 'epoch': 2.526595744680851}
{'loss': 0.2337, 'grad_norm': 0.3966846168041229, 'learning_rate': 6.979181994870587e-06, 'epoch': 2.6595744680851063}
{'loss': 0.2492, 'grad_norm': 0.5187064409255981, 'learning_rate': 2.6109132725262164e-06, 'epoch': 2.7925531914893615}
{'loss': 0.2361, 'grad_norm': 0.5425437688827515, 'learning_rate': 3.3773311539742057e-07, 'epoch': 2.925531914893617}
{'train_runtime': 147.2138, 'train_samples_per_second': 30.568, 'train_steps_per_second': 3.831, 'train_loss': 0.3937793999698991, 'epoch': 3.0}
[2026-05-05 00:15:39] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / aqua_rat
[2026-05-05 00:15:39] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / mbpp_test_held -> /workspace/round3_out/round4/X/mbpp_test_held
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 2.5622, 'grad_norm': 7.437411308288574, 'learning_rate': 0.0001, 'epoch': 0.07692307692307693}
{'loss': 0.7589, 'grad_norm': 0.45616862177848816, 'learning_rate': 6.271435222196916e-05, 'epoch': 1.9230769230769231}
{'train_runtime': 12.6983, 'train_samples_per_second': 23.625, 'train_steps_per_second': 3.071, 'train_loss': 0.6965176264444987, 'epoch': 3.0}
[2026-05-05 00:16:04] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / mbpp_test_held
{'loss': 3.884, 'grad_norm': 4.506526947021484, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.006060606060606061}
{'loss': 2.3995, 'grad_norm': 0.6118884682655334, 'learning_rate': 0.0002, 'epoch': 0.15151515151515152}
{'loss': 1.2815, 'grad_norm': 0.6239715814590454, 'learning_rate': 0.00019860702539900287, 'epoch': 0.30303030303030304}
{'loss': 1.194, 'grad_norm': 0.5346339344978333, 'learning_rate': 0.0001944669091607919, 'epoch': 0.45454545454545453}
{'loss': 1.1739, 'grad_norm': 0.534546434879303, 'learning_rate': 0.00018769499282066717, 'epoch': 0.6060606060606061}
{'loss': 1.2139, 'grad_norm': 0.5251949429512024, 'learning_rate': 0.0001784799385278661, 'epoch': 0.7575757575757576}
{'loss': 1.1891, 'grad_norm': 0.4779992401599884, 'learning_rate': 0.00016707847301392236, 'epoch': 0.9090909090909091}
{'loss': 1.0965, 'grad_norm': 0.5101320743560791, 'learning_rate': 0.00015380823531633729, 'epoch': 1.0606060606060606}
{'loss': 1.0722, 'grad_norm': 0.6076408624649048, 'learning_rate': 0.00013903892751634947, 'epoch': 1.2121212121212122}
{'loss': 1.0252, 'grad_norm': 0.6454022526741028, 'learning_rate': 0.00012318201502675285, 'epoch': 1.3636363636363638}
{'loss': 1.0416, 'grad_norm': 0.671332061290741, 'learning_rate': 0.00010667926337451217, 'epoch': 1.5151515151515151}
{'loss': 1.0702, 'grad_norm': 0.71003657579422, 'learning_rate': 8.999043083759017e-05, 'epoch': 1.6666666666666665}
{'loss': 1.0308, 'grad_norm': 0.7167981266975403, 'learning_rate': 7.358045981287141e-05, 'epoch': 1.8181818181818183}
{'loss': 1.0398, 'grad_norm': 0.7149976491928101, 'learning_rate': 5.790652375716652e-05, 'epoch': 1.9696969696969697}
{'loss': 0.8998, 'grad_norm': 0.8639608025550842, 'learning_rate': 4.340529056694047e-05, 'epoch': 2.121212121212121}
{'loss': 0.8749, 'grad_norm': 0.9446839690208435, 'learning_rate': 3.0480757232535772e-05, 'epoch': 2.2727272727272725}
{'loss': 0.8384, 'grad_norm': 1.06491219997406, 'learning_rate': 1.9492994687243714e-05, 'epoch': 2.4242424242424243}
{'loss': 0.8512, 'grad_norm': 0.9113675355911255, 'learning_rate': 1.0748116414011888e-05, 'epoch': 2.5757575757575757}
{'loss': 0.831, 'grad_norm': 0.9470285773277283, 'learning_rate': 4.489750279308757e-06, 'epoch': 2.7272727272727275}
{'loss': 0.843, 'grad_norm': 1.0753146409988403, 'learning_rate': 8.922511845219971e-07, 'epoch': 2.878787878787879}
{'train_runtime': 96.9305, 'train_samples_per_second': 40.823, 'train_steps_per_second': 5.107, 'train_loss': 1.0953025687824596, 'epoch': 3.0}
[2026-05-05 00:16:08] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / gsm_hard
[2026-05-05 00:16:08] [TRAIN_START] meta-llama/Llama-3.2-3B-Instruct / arc_challenge -> /workspace/round3_out/round4/Y/arc_challenge
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 2.1308, 'grad_norm': 6.77037239074707, 'learning_rate': 2.5e-05, 'epoch': 0.020833333333333332}
{'loss': 0.9169, 'grad_norm': 0.3555704355239868, 'learning_rate': 0.0001923879532511287, 'epoch': 0.5208333333333334}
{'loss': 0.5314, 'grad_norm': 0.32545629143714905, 'learning_rate': 0.0001565136414422592, 'epoch': 1.0416666666666667}
{'loss': 0.4567, 'grad_norm': 0.3761088252067566, 'learning_rate': 0.00010230978916530012, 'epoch': 1.5625}
{'loss': 0.4193, 'grad_norm': 0.3940057158470154, 'learning_rate': 4.735678371226441e-05, 'epoch': 2.0833333333333335}
{'loss': 0.3471, 'grad_norm': 0.47685983777046204, 'learning_rate': 9.477991470251791e-06, 'epoch': 2.6041666666666665}
{'train_runtime': 48.3808, 'train_samples_per_second': 23.439, 'train_steps_per_second': 2.976, 'train_loss': 0.5177151577340232, 'epoch': 3.0}
[2026-05-05 00:16:27] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / mbpp_plus
{'loss': 3.768, 'grad_norm': 4.5531511306762695, 'learning_rate': 9.523809523809523e-06, 'epoch': 0.007142857142857143}
{'loss': 2.1189, 'grad_norm': 0.656963050365448, 'learning_rate': 0.00019995040840893388, 'epoch': 0.17857142857142858}
{'loss': 0.9756, 'grad_norm': 0.5717995762825012, 'learning_rate': 0.00019740443316860467, 'epoch': 0.35714285714285715}
{'loss': 0.9438, 'grad_norm': 0.5498923659324646, 'learning_rate': 0.00019109653447608606, 'epoch': 0.5357142857142857}
{'loss': 0.9094, 'grad_norm': 0.5308817625045776, 'learning_rate': 0.00018127033401525301, 'epoch': 0.7142857142857143}
{'loss': 0.9256, 'grad_norm': 0.5131648778915405, 'learning_rate': 0.00016830533621682822, 'epoch': 0.8928571428571429}
{'loss': 0.8848, 'grad_norm': 0.5161603689193726, 'learning_rate': 0.0001527022711573479, 'epoch': 1.0714285714285714}
{'loss': 0.7818, 'grad_norm': 0.7658253908157349, 'learning_rate': 0.00013506375551927547, 'epoch': 1.25}
{'loss': 0.7921, 'grad_norm': 0.7582848072052002, 'learning_rate': 0.00011607101851859346, 'epoch': 1.4285714285714286}
{'loss': 0.7878, 'grad_norm': 0.675399124622345, 'learning_rate': 9.645759168379463e-05, 'epoch': 1.6071428571428572}
{'loss': 0.7958, 'grad_norm': 0.7119758725166321, 'learning_rate': 7.698097863137152e-05, 'epoch': 1.7857142857142856}
{'loss': 0.7427, 'grad_norm': 0.8570743799209595, 'learning_rate': 5.839339899884628e-05, 'epoch': 1.9642857142857144}
{'loss': 0.6073, 'grad_norm': 0.9859674572944641, 'learning_rate': 4.141273645397754e-05, 'epoch': 2.142857142857143}
{'loss': 0.5545, 'grad_norm': 1.1067148447036743, 'learning_rate': 2.669481281701739e-05, 'epoch': 2.3214285714285716}
{'loss': 0.5834, 'grad_norm': 1.1293858289718628, 'learning_rate': 1.4808059116167305e-05, 'epoch': 2.5}
{'loss': 0.5513, 'grad_norm': 1.1833295822143555, 'learning_rate': 6.211561822781476e-06, 'epoch': 2.678571428571429}
{'loss': 0.5619, 'grad_norm': 1.1636583805084229, 'learning_rate': 1.237332157732063e-06, 'epoch': 2.857142857142857}
{'train_runtime': 79.0103, 'train_samples_per_second': 42.488, 'train_steps_per_second': 5.316, 'train_loss': 0.8358220344498044, 'epoch': 3.0}
[2026-05-05 00:17:39] [TRAIN_DONE] meta-llama/Llama-3.2-3B-Instruct / arc_challenge
{'loss': 3.6585, 'grad_norm': 12.32223129272461, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.006060606060606061}
{'loss': 2.0153, 'grad_norm': 0.5624030232429504, 'learning_rate': 0.0002, 'epoch': 0.15151515151515152}
{'loss': 1.0853, 'grad_norm': 0.5862419605255127, 'learning_rate': 0.00019860702539900287, 'epoch': 0.30303030303030304}
{'loss': 1.0602, 'grad_norm': 0.41786763072013855, 'learning_rate': 0.0001944669091607919, 'epoch': 0.45454545454545453}
{'loss': 1.0371, 'grad_norm': 0.3576550781726837, 'learning_rate': 0.00018769499282066717, 'epoch': 0.6060606060606061}
{'loss': 1.0725, 'grad_norm': 1.0810116529464722, 'learning_rate': 0.0001784799385278661, 'epoch': 0.7575757575757576}
{'loss': 1.0494, 'grad_norm': 0.3635009527206421, 'learning_rate': 0.00016707847301392236, 'epoch': 0.9090909090909091}
{'loss': 0.9753, 'grad_norm': 0.6509284377098083, 'learning_rate': 0.00015380823531633729, 'epoch': 1.0606060606060606}
{'loss': 0.9519, 'grad_norm': 0.4772036373615265, 'learning_rate': 0.00013903892751634947, 'epoch': 1.2121212121212122}
{'loss': 0.9085, 'grad_norm': 0.4841245710849762, 'learning_rate': 0.00012318201502675285, 'epoch': 1.3636363636363638}
{'loss': 0.9307, 'grad_norm': 0.5715106129646301, 'learning_rate': 0.00010667926337451217, 'epoch': 1.5151515151515151}
{'loss': 0.95, 'grad_norm': 0.5565882921218872, 'learning_rate': 8.999043083759017e-05, 'epoch': 1.6666666666666665}
{'loss': 0.9132, 'grad_norm': 0.6265892386436462, 'learning_rate': 7.358045981287141e-05, 'epoch': 1.8181818181818183}
{'loss': 0.927, 'grad_norm': 0.5717490315437317, 'learning_rate': 5.790652375716652e-05, 'epoch': 1.9696969696969697}
{'loss': 0.8006, 'grad_norm': 0.7313489317893982, 'learning_rate': 4.340529056694047e-05, 'epoch': 2.121212121212121}
{'loss': 0.7657, 'grad_norm': 0.8018691539764404, 'learning_rate': 3.0480757232535772e-05, 'epoch': 2.2727272727272725}
{'loss': 0.7375, 'grad_norm': 1.0414997339248657, 'learning_rate': 1.9492994687243714e-05, 'epoch': 2.4242424242424243}
{'loss': 0.7471, 'grad_norm': 0.7975998520851135, 'learning_rate': 1.0748116414011888e-05, 'epoch': 2.5757575757575757}
{'loss': 0.7401, 'grad_norm': 0.9030542373657227, 'learning_rate': 4.489750279308757e-06, 'epoch': 2.7272727272727275}
{'loss': 0.7414, 'grad_norm': 0.9085662364959717, 'learning_rate': 8.922511845219971e-07, 'epoch': 2.878787878787879}
{'train_runtime': 120.567, 'train_samples_per_second': 32.82, 'train_steps_per_second': 4.106, 'train_loss': 0.96247435675727, 'epoch': 3.0}
[2026-05-05 00:17:49] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / gsm_hard
[2026-05-05 00:17:49] [TRAIN_START] Qwen/Qwen2.5-3B-Instruct / arc_challenge -> /workspace/round3_out/round4/X/arc_challenge
WARNING:accelerate.utils.other:Detected kernel version 4.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/usr/local/lib/python3.11/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
{'loss': 3.7325, 'grad_norm': 13.752643585205078, 'learning_rate': 9.523809523809523e-06, 'epoch': 0.007142857142857143}
{'loss': 1.7027, 'grad_norm': 0.48250889778137207, 'learning_rate': 0.00019995040840893388, 'epoch': 0.17857142857142858}
{'loss': 0.8574, 'grad_norm': 0.41164278984069824, 'learning_rate': 0.00019740443316860467, 'epoch': 0.35714285714285715}
{'loss': 0.8837, 'grad_norm': 0.4247625768184662, 'learning_rate': 0.00019109653447608606, 'epoch': 0.5357142857142857}
{'loss': 0.8492, 'grad_norm': 0.4014907479286194, 'learning_rate': 0.00018127033401525301, 'epoch': 0.7142857142857143}
{'loss': 0.872, 'grad_norm': 0.3833802044391632, 'learning_rate': 0.00016830533621682822, 'epoch': 0.8928571428571429}
{'loss': 0.8381, 'grad_norm': 0.3464404046535492, 'learning_rate': 0.0001527022711573479, 'epoch': 1.0714285714285714}
{'loss': 0.735, 'grad_norm': 0.6449007391929626, 'learning_rate': 0.00013506375551927547, 'epoch': 1.25}
{'loss': 0.7462, 'grad_norm': 0.6440525650978088, 'learning_rate': 0.00011607101851859346, 'epoch': 1.4285714285714286}
{'loss': 0.7331, 'grad_norm': 0.6458334922790527, 'learning_rate': 9.645759168379463e-05, 'epoch': 1.6071428571428572}
{'loss': 0.7438, 'grad_norm': 0.6578438878059387, 'learning_rate': 7.698097863137152e-05, 'epoch': 1.7857142857142856}
{'loss': 0.6972, 'grad_norm': 0.729121208190918, 'learning_rate': 5.839339899884628e-05, 'epoch': 1.9642857142857144}
{'loss': 0.5626, 'grad_norm': 0.7916315197944641, 'learning_rate': 4.141273645397754e-05, 'epoch': 2.142857142857143}
{'loss': 0.5037, 'grad_norm': 1.2884725332260132, 'learning_rate': 2.669481281701739e-05, 'epoch': 2.3214285714285716}
{'loss': 0.5354, 'grad_norm': 1.0814889669418335, 'learning_rate': 1.4808059116167305e-05, 'epoch': 2.5}
{'loss': 0.4954, 'grad_norm': 1.1070040464401245, 'learning_rate': 6.211561822781476e-06, 'epoch': 2.678571428571429}
{'loss': 0.5168, 'grad_norm': 1.0021690130233765, 'learning_rate': 1.237332157732063e-06, 'epoch': 2.857142857142857}
{'train_runtime': 99.9713, 'train_samples_per_second': 33.58, 'train_steps_per_second': 4.201, 'train_loss': 0.7603948621522812, 'epoch': 3.0}
[2026-05-05 00:19:41] [TRAIN_DONE] Qwen/Qwen2.5-3B-Instruct / arc_challenge
[2026-05-05 00:19:43] [ANCHOR_MISSING] asdiv
[2026-05-05 00:19:43] [ANCHOR_MISSING] mawps
[2026-05-05 00:19:43] [ANCHOR_MISSING] codealpaca_mini
[2026-05-05 00:19:43] [ANCHOR_MISSING] apps_introductory
[2026-05-05 00:19:43] [ANCHOR_MISSING] conala_curated
[2026-05-05 00:19:43] [ANCHOR_MISSING] codecontests_easy
[2026-05-05 00:19:43] [ANCHOR_MISSING] livecodebench_easy
[2026-05-05 00:19:43] [ANCHOR_MISSING] pubmedqa_pqal
[2026-05-05 00:19:43] Available anchors: 16 counts={'math': 6, 'code': 3, 'science': 7}
[2026-05-05 00:19:43] [EXP1] Building/evaluating main mapping table
[2026-05-05 00:19:43] [EXP1_TASK] gsm_hard
/usr/local/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:590: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/usr/local/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:595: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
[2026-05-05 00:25:19] [EXP1_DONE] gsm_hard: {'Domain': 'math', 'Task': 'gsm_hard', 'base_Y': 0.06333333333333334, 'mean': 0.056666666666666664, 'global_ridge': 0.06, 'pertensor_ridge': 0.06666666666666667, 'topk8_global_ridge': 0.06666666666666667, 'topk8_pertensor_ridge': 0.06333333333333334, 'pertensor_mlp': 0.07333333333333333, 'oracle': 0.15, 'oracle_minus_base_pp': 8.666666666666666, 'usable': True, 'gap_recovered': 0.11538461538461534}
[2026-05-05 00:25:19] [EXP1_TASK] gsm8k_test_500
[2026-05-05 00:29:32] [EXP1_DONE] gsm8k_test_500: {'Domain': 'math', 'Task': 'gsm8k_test_500', 'base_Y': 0.08, 'mean': 0.09333333333333334, 'global_ridge': 0.1, 'pertensor_ridge': 0.1, 'topk8_global_ridge': 0.09333333333333334, 'topk8_pertensor_ridge': 0.09666666666666666, 'pertensor_mlp': 0.1, 'oracle': 0.29333333333333333, 'oracle_minus_base_pp': 21.333333333333332, 'usable': True, 'gap_recovered': 0.09375000000000003}
[2026-05-05 00:29:32] [EXP1_TASK] mbpp_test_held
[2026-05-05 00:37:14] [EXP1_DONE] mbpp_test_held: {'Domain': 'code', 'Task': 'mbpp_test_held', 'base_Y': 0.23, 'mean': 0.24, 'global_ridge': 0.25, 'pertensor_ridge': 0.25, 'topk8_global_ridge': 0.25, 'topk8_pertensor_ridge': 0.25, 'pertensor_mlp': 0.24, 'oracle': 0.32, 'oracle_minus_base_pp': 9.0, 'usable': True, 'gap_recovered': 0.22222222222222213}
[2026-05-05 00:37:14] [EXP1_TASK] mbpp_plus
[2026-05-05 00:54:10] [EXP1_DONE] mbpp_plus: {'Domain': 'code', 'Task': 'mbpp_plus', 'base_Y': 0.21666666666666667, 'mean': 0.21333333333333335, 'global_ridge': 0.28, 'pertensor_ridge': 0.27, 'topk8_global_ridge': 0.27, 'topk8_pertensor_ridge': 0.26666666666666666, 'pertensor_mlp': 0.21, 'oracle': 0.45, 'oracle_minus_base_pp': 23.333333333333332, 'usable': True, 'gap_recovered': 0.2714285714285715}
[2026-05-05 00:54:10] [EXP1_TASK] arc_challenge
[2026-05-05 00:59:21] [EXP1_DONE] arc_challenge: {'Domain': 'science', 'Task': 'arc_challenge', 'base_Y': 0.7157190635451505, 'mean': 0.7324414715719063, 'global_ridge': 0.7357859531772575, 'pertensor_ridge': 0.7290969899665551, 'topk8_global_ridge': 0.7357859531772575, 'topk8_pertensor_ridge': 0.7290969899665551, 'pertensor_mlp': 0.7391304347826086, 'oracle': 0.7224080267558528, 'oracle_minus_base_pp': 0.6688963210702337, 'usable': False, 'gap_recovered': 5.0}
[2026-05-05 00:59:21] [EXP1_TASK] openbookqa_test
[2026-05-05 01:04:13] [EXP1_DONE] openbookqa_test: {'Domain': 'science', 'Task': 'openbookqa_test', 'base_Y': 0.71, 'mean': 0.76, 'global_ridge': 0.7466666666666667, 'pertensor_ridge': 0.7433333333333333, 'topk8_global_ridge': 0.7133333333333334, 'topk8_pertensor_ridge': 0.7166666666666667, 'pertensor_mlp': 0.7533333333333333, 'oracle': 0.9833333333333333, 'oracle_minus_base_pp': 27.333333333333332, 'usable': True, 'gap_recovered': 0.18292682926829285}
[2026-05-05 01:04:13] [EXP2] Anchor-count + Top-K scaling
[2026-05-05 01:07:50] [EXP2] N5_global_ridge: {'math': -0.010817307692307696, 'code': 0.026984126984126878, 'science': 1.5304878048780488}
[2026-05-05 01:11:35] [EXP2] N12_global_ridge: {'math': -0.03425480769230772, 'code': 0.2539682539682539, 'science': 2.073170731707317}
[2026-05-05 01:15:22] [EXP2] N12_topk8_global_ridge: {'math': -0.010817307692307696, 'code': 0.2738095238095238, 'science': 1.3353658536585367}
[2026-05-05 01:19:19] [EXP2] N12_topk12_global_ridge: {'math': -0.018629807692307716, 'code': 0.23253968253968244, 'science': 1.829268292682927}
[2026-05-05 01:19:19] [EXP2] N16_global_ridge: {'math': 0.027644230769230737, 'code': 0.24682539682539684, 'science': 1.5670731707317074}
[2026-05-05 01:19:19] [EXP2] N16_topk8_global_ridge: {'math': 0.050480769230769204, 'code': 0.22539682539682537, 'science': 1.50609756097561}
[2026-05-05 01:23:44] [EXP2] N16_topk12_global_ridge: {'math': 0.03906249999999999, 'code': 0.2396825396825396, 'science': 2.3170731707317076}
[2026-05-05 01:23:44] [EXP3] Cross-domain transfer heatmap
[2026-05-05 01:24:09] [EXP3] math-only -> gsm_hard: acc=0.0600 gap=-0.038461538461538554
[2026-05-05 01:24:27] [EXP3] math-only -> gsm8k_test_500: acc=0.0967 gap=0.07812499999999999
[2026-05-05 01:25:08] [EXP3] math-only -> mbpp_test_held: acc=0.2300 gap=0.0
[2026-05-05 01:26:47] [EXP3] math-only -> mbpp_plus: acc=0.2067 gap=-0.04285714285714289
[2026-05-05 01:27:04] [EXP3] math-only -> arc_challenge: acc=0.7291 gap=2.0
[2026-05-05 01:27:20] [EXP3] math-only -> openbookqa_test: acc=0.7400 gap=0.10975609756097571
[2026-05-05 01:27:49] [EXP3] code-only -> gsm_hard: acc=0.0500 gap=-0.15384615384615388
[2026-05-05 01:28:14] [EXP3] code-only -> gsm8k_test_500: acc=0.0733 gap=-0.03125000000000001
[2026-05-05 01:28:53] [EXP3] code-only -> mbpp_test_held: acc=0.2700 gap=0.44444444444444453
[2026-05-05 01:30:28] [EXP3] code-only -> mbpp_plus: acc=0.2733 gap=0.24285714285714274
[2026-05-05 01:30:44] [EXP3] code-only -> arc_challenge: acc=0.7057 gap=-1.5
[2026-05-05 01:30:59] [EXP3] code-only -> openbookqa_test: acc=0.7133 gap=0.012195121951219795
[2026-05-05 01:31:30] [EXP3] science-only -> gsm_hard: acc=0.0600 gap=-0.038461538461538554
[2026-05-05 01:31:59] [EXP3] science-only -> gsm8k_test_500: acc=0.1167 gap=0.171875
[2026-05-05 01:32:41] [EXP3] science-only -> mbpp_test_held: acc=0.2400 gap=0.11111111111111091
[2026-05-05 01:34:26] [EXP3] science-only -> mbpp_plus: acc=0.2133 gap=-0.01428571428571426
[2026-05-05 01:34:57] [EXP3] science-only -> arc_challenge: acc=0.7391 gap=3.5
[2026-05-05 01:35:27] [EXP3] science-only -> openbookqa_test: acc=0.7400 gap=0.10975609756097571
[2026-05-05 01:35:54] [EXP3] math+code -> gsm_hard: acc=0.0500 gap=-0.15384615384615388
[2026-05-05 01:36:13] [EXP3] math+code -> gsm8k_test_500: acc=0.1000 gap=0.09375000000000003
[2026-05-05 01:36:56] [EXP3] math+code -> mbpp_test_held: acc=0.2500 gap=0.22222222222222213
[2026-05-05 01:38:37] [EXP3] math+code -> mbpp_plus: acc=0.2667 gap=0.21428571428571425
[2026-05-05 01:38:54] [EXP3] math+code -> arc_challenge: acc=0.7191 gap=0.5
[2026-05-05 01:39:12] [EXP3] math+code -> openbookqa_test: acc=0.7367 gap=0.09756097560975632
[2026-05-05 01:39:40] [EXP3] all -> gsm_hard: acc=0.0667 gap=0.038461538461538394
[2026-05-05 01:40:02] [EXP3] all -> gsm8k_test_500: acc=0.0933 gap=0.06250000000000001
[2026-05-05 01:40:45] [EXP3] all -> mbpp_test_held: acc=0.2500 gap=0.22222222222222213
[2026-05-05 01:42:26] [EXP3] all -> mbpp_plus: acc=0.2700 gap=0.22857142857142862
[2026-05-05 01:42:59] [EXP3] all -> arc_challenge: acc=0.7358 gap=3.0
[2026-05-05 01:43:28] [EXP3] all -> openbookqa_test: acc=0.7133 gap=0.012195121951219795
[2026-05-05 01:43:29] [PUSH] Creating/uploading to CK0607/cross-model-lora-prediction-3b
It seems you are trying to upload a large folder at once. This might take some time and then fail if the folder is too large. For such cases, it is recommended to upload in smaller batches or to use `HfApi().upload_large_folder(...)`/`hf upload-large-folder` instead. For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#upload-a-large-folder.