---
base_model: Qwen/Qwen2.5-7B-Base
library_name: transformers
pipeline_tag: text-generation
datasets:
- OpenDataArena/ODA-Math-460k
tags:
- qwen2.5
- sft
- opendataarena
- oda-math
- math
- reasoning
license: cc-by-nc-4.0
language:
- en
metrics:
- accuracy
---
# Qwen2.5-7B-ODA-Math-460k
Qwen2.5-7B-ODA-Math-460k is a supervised fine-tuned (SFT) model built on top of **Qwen2.5-7B-Base**, trained with **[ODA-Math-460k](https://huggingface.co/datasets/OpenDataArena/ODA-Math-460k)**.
ODA-Math-460k is a large-scale math reasoning dataset curated from top-performing open mathematics corpora (selected via the *[OpenDataArena](https://opendataarena.github.io)* leaderboard) and refined through **deduplication**, **benchmark decontamination**, **LLM-based filtering**, and **verifier-backed response distillation**.
It targets a β**learnable but challenging**β difficulty band: non-trivial for smaller models yet solvable by stronger reasoning models.
---
## π§ Model Summary
- **Base Model**: `Qwen/Qwen2.5-7B-Base`
- **Training Data**: `OpenDataArena/ODA-Math-460k`
- **Domain Coverage**: Mathematics (strictly filtered)
- **Scale (selected training set)**: ~**460K** problems (after selection and verification pipeline)
- **Goal**: Efficiently improve mathematical reasoning and competition-style problem solving via high-quality, validated solutions.
---
## βοΈ Training Data Curation Pipeline
ODA-Math-460k is constructed from an aggregated question pool and then progressively filtered and selected.
### 1οΈβ£ Data Collection
We prioritize source datasets based on their empirical impact on downstream model performance. Using the *OpenDataArena* leaderboard, we aggregate top-ranking math datasets that show strong efficacy for the **Qwen** and **Llama** model families. These sources form the initial pool for ODA-Math.
### 2οΈβ£ Deduplication & Decontamination
We first perform **exact deduplication** over all questions to remove identical items, and then run **benchmark decontamination** to reduce evaluation leakage by removing overlaps with standard and competition benchmarks.
### 3οΈβ£ Question Filtering (Quality & Suitability)
A multi-stage filtering pipeline refines domain specificity and usability by applying an LLM-based **domain classifier** (to remove out-of-domain items such as coding/general instruction tasks), an LLM-based **validity validator** (to remove ill-formed questions with missing premises or undefined notation), and **problem-type filtering** (via the *Big Math* toolkit) to exclude proof questions and guessing-prone formats like multiple-choice and true/falseβleaving predominantly **free-form** problems with objectively verifiable answers.
### π Filtration Statistics
| Pipeline Stage | Count | Percentage |
|---|---:|---:|
| Raw Collection | 11.4M | 100% |
| Dedup & Decontamination | 4.3M | 37.7% |
| Question Filtering | 3.3M | 28.9% |
| Stage-1 Filtering | 815.3K | 7.2% |
| Stage-2 Filtering | 459.6K | 4.0% |
---
## π― Data Selection
Given the large curated pool, ODA-Math-460k retains problems that are **hard for small models** but **solvable for stronger reasoning models**.
### Stage-1: Lower-Bound Filtering
Stage-1 removes trivial problems using **Qwen3-8B** in *non-thinking* mode: for each problem we sample **k=4** responses, compute **Pass@4** by matching each predicted final answer to **y_gt**, and keep the problem **only if** **Pass@4(x) = 0** (i.e., none of four attempts is correct).
### Stage-2: Upper-Bound Filtering
Stage-2 removes unsolvable or ambiguous problems using **Qwen3-30B-A3B** in *thinking* mode: we generate **k=5** reasoning traces per problem, compute **Pass@5**, and keep the problem **only if** **Pass@5(x) > 0** (i.e., at least one attempt solves it).
---
## β
Distillation & Verification
### π§ͺ Response Synthesis
We distill solutions using **AM-Thinking-v1** as the teacher, generating **k=5** candidate reasoning traces (step-by-step solution + final answer) for each selected problem.
### π Response Verification
We verify generated responses with **Compass-Verifier-7B**, which takes (problem **x**, generated response **y_gen**, ground-truth answer **y_gt**) and outputs a binary correctness decision (**correct** / **incorrect**). We keep only the (problem, response) pairs judged **correct**, and discard the restβso the released dataset contains **verified solutions only**.
---
## π Training Data Source Composition
ODA-Math-460k is a mixture of multiple high-quality math datasets to avoid domination by a single style/annotation protocol. Top contributors:
| Source | Count | Percentage |
|---|---:|---:|
| ScaleQuest-Math | 87,755 | 19.09% |
| NuminaMath-CoT | 75,971 | 16.53% |
| OpenMathInstruct-2 | 65,688 | 14.29% |
| MegaScience (math) | 54,904 | 11.94% |
| OpenMathReasoning | 49,463 | 10.76% |
| AM-Thinking-Distilled | 38,375 | 8.35% |
| MiroMind-M1-SFT-719K | 23,417 | 5.09% |
| SCP-116K | 16,066 | 3.50% |
| DeepMath-309K | 11,956 | 2.60% |
| math-gpt-4o-200k | 8,355 | 1.82% |
| OpenR1-Math-220k | 7,999 | 1.74% |
| MathFusionQA | 6,510 | 1.42% |
---
## π¬ Content Characteristics
### π Subject Distribution
ODA-Math-460k maintains a **more balanced** subject composition than several peers:
- Algebra remains substantial (**~44.8%**),
- Geometry roughly **20β22%**,
- Calculus, Discrete Math & Probability, and Number Theory each around **~11%**.
This mitigates subject bias and reduces performance drops on underrepresented topics.
### π Difficulty Distribution
Apart from model-based pass rate, we also adopt LLM-as-Judge difficulty estimation on a **1-10 scale**, mapped to the [AoPS ratings](https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings).
| Level | Equivalent Competition Tier | Description |
| :--- | :--- | :--- |
| **1** | **Elementary / Middle School** | MOEMS, AMC 8 (Early Qs). Standard word problems. |
| **2** | **Junior High** | AMC 8 (Hard), AMC 10 (Early). Complex word problems. |
| **3** | **High School Beginner** | AMC 10 (Mid), AMC 12 (Early). Requires creative thinking. |
| **4** | **High School Intermediate** | AMC 12 (Mid), AIME (Early). Intermediate complexity. |
| **5** | **Advanced High School** | AIME (Mid), JBMO. Simple proof-based Olympiad style. |
| **6** | **Pre-Olympiad** | AIME (Hard), USAJMO. Introductory Olympiad level. |
| **7** | **Olympiad (Entry)** | IMO (Easy/Medium), USAMO. Requires technical knowledge. |
| **8** | **Olympiad (Medium)** | IMO (Medium/Hard). High-level competition problems. |
| **9** | **Olympiad (Expert)** | IMO (Hard). Expert-level constructions/proofs. |
| **10** | **Historically Hard** | Outliers. Exceedingly tedious or difficult even for Olympians. |
ODA-Math-460k features a balanced mix of fundamental and intermediate reasoning tasks:
- Primary Mode: Difficulty 1 (~110k samples), providing a dense foundation of basic mathematical concepts.
- Secondary Mode: Difficulty 6 (~72k samples), offering a significant concentration of intermediate-level challenges.
- Tail: A steady decline toward Difficulty 10, maintaining a specialized set of high-complexity queries.
---
## π Performance
ODA-Math-460k is evaluated as an SFT corpus for **Qwen2.5-7B-Base**.
Results show consistent gains over base checkpoints, with particularly strong improvements on **competition-style** benchmarks.
| Dataset | Size | GSM8K | Math500 | Omni-Math | Olympiad | AIME'24 | AIME'25 | CMIMC'25 | HMMT'25 | BRUMO'25 | AVG |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-7B-Base | |||||||||||
| Qwen2.5-7B-Base | - | 80.0 | 50.2 | 26.0 | 35.9 | 6.7 | 6.7 | 10.0 | 0.0 | 20.0 | 26.2 |
| LIMO | 817 | 92.1 | 66.8 | 21.6 | 34.9 | 4.6 | 1.7 | 0.0 | 0.0 | 5.4 | 25.2 |
| OpenMathInstruct-2 | 1M | 91.6 | 65.9 | 22.5 | 30.7 | 6.7 | 5.0 | 5.0 | 0.0 | 13.6 | 26.8 |
| MegaScience (math) | 414k | 90.1 | 77.8 | 28.7 | 44.5 | 16.7 | 15.0 | 8.1 | 0.0 | 26.7 | 34.2 |
| Fast-Math-R1-SFT | 8k | 90.6 | 80.0 | 35.8 | 50.3 | 23.3 | 26.7 | 7.5 | 8.3 | 31.7 | 39.4 |
| DeepMath-103K | 103k | 92.1 | 92.0 | 45.4 | 60.2 | 34.2 | 31.7 | 10.0 | 11.7 | 15.0 | 43.6 |
| Light-R1-SFT | 79k | 92.0 | 88.0 | 43.3 | 60.2 | 38.3 | 26.7 | 22.5 | 13.3 | 38.3 | 47.0 |
| SYNTHETIC-2 (math) | 50k | 92.1 | 90.0 | 54.5 | 67.4 | 45.0 | 35.0 | 19.7 | 20.0 | 36.7 | 51.2 |
| MiroMind-M1-SFT | 719k | 93.9 | 91.6 | 48.1 | 66.3 | 55.0 | 30.0 | 27.5 | 18.3 | 50.0 | 53.4 |
| OmniThought-0528 | 365k | 93.2 | 89.8 | 54.3 | 68.1 | 50.4 | 40.0 | 25.0 | 28.3 | 45.0 | 54.9 |
| OpenThoughts3 | 1.2M | 91.7 | 93.8 | 44.8 | 68.8 | 60.0 | 45.0 | 27.5 | 31.7 | 50.0 | 57.0 |
| AM-Thinking (math) | 558k | 92.9 | 96.2 | 60.6 | 74.2 | 63.3 | 50.0 | 27.8 | 36.7 | 63.3 | 62.8 |
| ODA-Math | 460k | 94.3 | 95.4 | 62.6 | 70.9 | 56.7 | 56.7 | 35.0 | 45.0 | 60.0 | 64.1 |