--- datasets: - Skywork/Skywork-Reward-Preference-80K-v0.1 base_model: - lblaoke/qwama-0.5b-skywork-pref-sft-chosen-trl-v3 --- learning_rate: 5.0e-7 num_train_epochs: 1 per_device_train_batch_size: 2 gradient_accumulation_steps: 8