--- {} --- # Qwen-3B-R1-AHA-V1 This model was trained using GRPO (Group Relative Policy Optimization) on the Countdown Game task to develop reasoning capabilities. ## Model Details - Base Model: Qwen/Qwen2.5-3B-Instruct - Training: GRPO with self-verification rewards - Task: Countdown Game mathematical reasoning ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("balnazzar/qwen-r1-aha") tokenizer = AutoTokenizer.from_pretrained("balnazzar/qwen-r1-aha") ``` ## Training - Dataset: Countdown-Tasks-3to4 - Reward Functions: Format checking and equation verification - Hardware: Nvidia A6000 (takes 45Gb)