2025-12-12 16:16:21,676 __main__ [INFO] OUT_DIR set to outputs/2025-12-12/16-16-21 2025-12-12 16:16:21,678 __main__ [INFO] Args: app/src/S3_6_sft.py --sft.per_device_train_batch_size 32 --sft.per_device_eval_batch_size 32 --sft.gradient_accumulation_steps 32 --sft.push_to_hub --mode mixed --post-str-ratio 0.5 --model.model_name_or_path tokyotech-llm/Llama-3.1-Swallow-8B-v0.5 2025-12-12 16:16:22,746 accelerate.utils.modeling get_balanced_memory [INFO] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk). 2025-12-12 16:16:35,558 __main__ load_model [INFO] Model loaded from tokyotech-llm/Llama-3.1-Swallow-8B-v0.5 LlamaForCausalLM( (model): LlamaModel( (embed_tokens): Embedding(128256, 4096) (layers): ModuleList( (0-31): 32 x LlamaDecoderLayer( (self_attn): LlamaAttention( (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False) (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False) (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False) (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False) ) (mlp): LlamaMLP( (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False) (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False) (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False) (act_fn): SiLUActivation() ) (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05) (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05) ) ) (norm): LlamaRMSNorm((4096,), eps=1e-05) (rotary_emb): LlamaRotaryEmbedding() ) (lm_head): Linear(in_features=4096, out_features=128256, bias=False) ) 2025-12-12 16:16:38,588 __main__ load_jwtd [INFO] Dataset shuffled, seed=42 2025-12-12 16:16:38,588 __main__ load_jwtd [INFO] Dataset loaded, N=57062 2025-12-12 16:16:39,725 __main__ load_jwtd [INFO] Dataset loaded, N=12228 2025-12-12 16:16:39,725 __main__ main [INFO] Filtered dataset: train 57062 rows, eval 12228 rows 2025-12-12 16:16:39,735 __main__ add_train_str_with_ratio [INFO] pre_str:posts_str = 28602:28460 = 0.501:0.499 2025-12-12 16:16:39,880 __main__ add_train_str_with_ratio [INFO] pre_str:posts_str = 6107:6121 = 0.499:0.501 2025-12-12 16:16:54,946 __main__ main [INFO] wandb initialized 2025-12-12 16:16:59,141 __main__ main [INFO] Starting SFT training with SFTTrainer 2025-12-12 16:21:50,741 root evaluate_probability_ratio [INFO] Results epoch 0: outputs/2025-12-12/16-16-21/probability_ratio_epoch_0.json 2025-12-12 16:38:43,185 root evaluate_probability_ratio [INFO] Results epoch 1: outputs/2025-12-12/16-16-21/probability_ratio_epoch_1.json 2025-12-12 16:55:34,203 root evaluate_probability_ratio [INFO] Results epoch 2: outputs/2025-12-12/16-16-21/probability_ratio_epoch_2.json 2025-12-12 17:12:18,740 root evaluate_probability_ratio [INFO] Results epoch 3: outputs/2025-12-12/16-16-21/probability_ratio_epoch_3.json 2025-12-12 17:29:17,372 root evaluate_probability_ratio [INFO] Results epoch 4: outputs/2025-12-12/16-16-21/probability_ratio_epoch_4.json 2025-12-12 17:46:10,154 root evaluate_probability_ratio [INFO] Results epoch 5: outputs/2025-12-12/16-16-21/probability_ratio_epoch_5.json