--- base_model: Qwen/Qwen2.5-7B-Base library_name: transformers pipeline_tag: text-generation datasets: - OpenDataArena/ODA-Math-460k tags: - qwen2.5 - sft - opendataarena - oda-math - math - reasoning license: cc-by-nc-4.0 language: - en metrics: - accuracy --- # Qwen2.5-7B-ODA-Math-460k Leaderboard Performance

Qwen2.5-7B-ODA-Math-460k is a supervised fine-tuned (SFT) model built on top of **Qwen2.5-7B-Base**, trained with **[ODA-Math-460k](https://huggingface.co/datasets/OpenDataArena/ODA-Math-460k)**. ODA-Math-460k is a large-scale math reasoning dataset curated from top-performing open mathematics corpora (selected via the *[OpenDataArena](https://opendataarena.github.io)* leaderboard) and refined through **deduplication**, **benchmark decontamination**, **LLM-based filtering**, and **verifier-backed response distillation**. It targets a “**learnable but challenging**” difficulty band: non-trivial for smaller models yet solvable by stronger reasoning models. --- ## 🧠 Model Summary - **Base Model**: `Qwen/Qwen2.5-7B-Base` - **Training Data**: `OpenDataArena/ODA-Math-460k` - **Domain Coverage**: Mathematics (strictly filtered) - **Scale (selected training set)**: ~**460K** problems (after selection and verification pipeline) - **Goal**: Efficiently improve mathematical reasoning and competition-style problem solving via high-quality, validated solutions. --- ## ⚙️ Training Data Curation Pipeline ODA-Math-460k is constructed from an aggregated question pool and then progressively filtered and selected. ### 1️⃣ Data Collection We prioritize source datasets based on their empirical impact on downstream model performance. Using the *OpenDataArena* leaderboard, we aggregate top-ranking math datasets that show strong efficacy for the **Qwen** and **Llama** model families. These sources form the initial pool for ODA-Math. ### 2️⃣ Deduplication & Decontamination We first perform **exact deduplication** over all questions to remove identical items, and then run **benchmark decontamination** to reduce evaluation leakage by removing overlaps with standard and competition benchmarks. ### 3️⃣ Question Filtering (Quality & Suitability) A multi-stage filtering pipeline refines domain specificity and usability by applying an LLM-based **domain classifier** (to remove out-of-domain items such as coding/general instruction tasks), an LLM-based **validity validator** (to remove ill-formed questions with missing premises or undefined notation), and **problem-type filtering** (via the *Big Math* toolkit) to exclude proof questions and guessing-prone formats like multiple-choice and true/false—leaving predominantly **free-form** problems with objectively verifiable answers. ### 📊 Filtration Statistics | Pipeline Stage | Count | Percentage | |---|---:|---:| | Raw Collection | 11.4M | 100% | | Dedup & Decontamination | 4.3M | 37.7% | | Question Filtering | 3.3M | 28.9% | | Stage-1 Filtering | 815.3K | 7.2% | | Stage-2 Filtering | 459.6K | 4.0% | --- ## 🎯 Data Selection Given the large curated pool, ODA-Math-460k retains problems that are **hard for small models** but **solvable for stronger reasoning models**. ### Stage-1: Lower-Bound Filtering Stage-1 removes trivial problems using **Qwen3-8B** in *non-thinking* mode: for each problem we sample **k=4** responses, compute **Pass@4** by matching each predicted final answer to **y_gt**, and keep the problem **only if** **Pass@4(x) = 0** (i.e., none of four attempts is correct). ### Stage-2: Upper-Bound Filtering Stage-2 removes unsolvable or ambiguous problems using **Qwen3-30B-A3B** in *thinking* mode: we generate **k=5** reasoning traces per problem, compute **Pass@5**, and keep the problem **only if** **Pass@5(x) > 0** (i.e., at least one attempt solves it). --- ## ✅ Distillation & Verification ### 🧪 Response Synthesis We distill solutions using **AM-Thinking-v1** as the teacher, generating **k=5** candidate reasoning traces (step-by-step solution + final answer) for each selected problem. ### 🔍 Response Verification We verify generated responses with **Compass-Verifier-7B**, which takes (problem **x**, generated response **y_gen**, ground-truth answer **y_gt**) and outputs a binary correctness decision (**correct** / **incorrect**). We keep only the (problem, response) pairs judged **correct**, and discard the rest—so the released dataset contains **verified solutions only**. --- ## 📚 Training Data Source Composition ODA-Math-460k is a mixture of multiple high-quality math datasets to avoid domination by a single style/annotation protocol. Top contributors: | Source | Count | Percentage | |---|---:|---:| | ScaleQuest-Math | 87,755 | 19.09% | | NuminaMath-CoT | 75,971 | 16.53% | | OpenMathInstruct-2 | 65,688 | 14.29% | | MegaScience (math) | 54,904 | 11.94% | | OpenMathReasoning | 49,463 | 10.76% | | AM-Thinking-Distilled | 38,375 | 8.35% | | MiroMind-M1-SFT-719K | 23,417 | 5.09% | | SCP-116K | 16,066 | 3.50% | | DeepMath-309K | 11,956 | 2.60% | | math-gpt-4o-200k | 8,355 | 1.82% | | OpenR1-Math-220k | 7,999 | 1.74% | | MathFusionQA | 6,510 | 1.42% | --- ## 🔬 Content Characteristics ### 📘 Subject Distribution

ODA-Math-460k maintains a **more balanced** subject composition than several peers: - Algebra remains substantial (**~44.8%**), - Geometry roughly **20–22%**, - Calculus, Discrete Math & Probability, and Number Theory each around **~11%**. This mitigates subject bias and reduces performance drops on underrepresented topics. ### 📉 Difficulty Distribution Apart from model-based pass rate, we also adopt LLM-as-Judge difficulty estimation on a **1-10 scale**, mapped to the [AoPS ratings](https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings). | Level | Equivalent Competition Tier | Description | | :--- | :--- | :--- | | **1** | **Elementary / Middle School** | MOEMS, AMC 8 (Early Qs). Standard word problems. | | **2** | **Junior High** | AMC 8 (Hard), AMC 10 (Early). Complex word problems. | | **3** | **High School Beginner** | AMC 10 (Mid), AMC 12 (Early). Requires creative thinking. | | **4** | **High School Intermediate** | AMC 12 (Mid), AIME (Early). Intermediate complexity. | | **5** | **Advanced High School** | AIME (Mid), JBMO. Simple proof-based Olympiad style. | | **6** | **Pre-Olympiad** | AIME (Hard), USAJMO. Introductory Olympiad level. | | **7** | **Olympiad (Entry)** | IMO (Easy/Medium), USAMO. Requires technical knowledge. | | **8** | **Olympiad (Medium)** | IMO (Medium/Hard). High-level competition problems. | | **9** | **Olympiad (Expert)** | IMO (Hard). Expert-level constructions/proofs. | | **10** | **Historically Hard** | Outliers. Exceedingly tedious or difficult even for Olympians. | Difficulty Distribution

ODA-Math-460k features a balanced mix of fundamental and intermediate reasoning tasks: - Primary Mode: Difficulty 1 (~110k samples), providing a dense foundation of basic mathematical concepts. - Secondary Mode: Difficulty 6 (~72k samples), offering a significant concentration of intermediate-level challenges. - Tail: A steady decline toward Difficulty 10, maintaining a specialized set of high-complexity queries. --- ## 📈 Performance ODA-Math-460k is evaluated as an SFT corpus for **Qwen2.5-7B-Base**. Results show consistent gains over base checkpoints, with particularly strong improvements on **competition-style** benchmarks.

Performance Comparison. Best scores in **bold**, second-best underlined.
Dataset	Size	GSM8K	Math500	Omni-Math	Olympiad	AIME'24	AIME'25	CMIMC'25	HMMT'25	BRUMO'25	AVG
Qwen2.5-7B-Base
Qwen2.5-7B-Base	-	80.0	50.2	26.0	35.9	6.7	6.7	10.0	0.0	20.0	26.2
LIMO	817	92.1	66.8	21.6	34.9	4.6	1.7	0.0	0.0	5.4	25.2
OpenMathInstruct-2	1M	91.6	65.9	22.5	30.7	6.7	5.0	5.0	0.0	13.6	26.8
MegaScience (math)	414k	90.1	77.8	28.7	44.5	16.7	15.0	8.1	0.0	26.7	34.2
Fast-Math-R1-SFT	8k	90.6	80.0	35.8	50.3	23.3	26.7	7.5	8.3	31.7	39.4
DeepMath-103K	103k	92.1	92.0	45.4	60.2	34.2	31.7	10.0	11.7	15.0	43.6
Light-R1-SFT	79k	92.0	88.0	43.3	60.2	38.3	26.7	22.5	13.3	38.3	47.0
SYNTHETIC-2 (math)	50k	92.1	90.0	54.5	67.4	45.0	35.0	19.7	20.0	36.7	51.2
MiroMind-M1-SFT	719k	93.9	91.6	48.1	66.3	55.0	30.0	27.5	18.3	50.0	53.4
OmniThought-0528	365k	93.2	89.8	54.3	68.1	50.4	40.0	25.0	28.3	45.0	54.9
OpenThoughts3	1.2M	91.7	93.8	44.8	68.8	60.0	45.0	27.5	31.7	50.0	57.0
AM-Thinking (math)	558k	92.9	96.2	60.6	74.2	63.3	50.0	27.8	36.7	63.3	62.8
ODA-Math	460k	94.3	95.4	62.6	70.9	56.7	56.7	35.0	45.0	60.0	64.1

--- ## 🌐 About OpenDataArena [OpenDataArena](https://opendataarena.github.io/) is an open research platform dedicated to **discovering, evaluating, and advancing high-quality datasets for AI post-training**. It provides a transparent, data-centric ecosystem to support reproducible dataset evaluation and sharing. **Key Features:** - 🏆 **Dataset Leaderboard** — helps researchers identify **the most valuable and high-quality datasets across different domains**. - 📊 **Detailed Evaluation Scores** — provides **comprehensive metrics** to assess data quality, complexity, difficulty etc. - 🧰 **Data Processing Toolkit** — [OpenDataArena-Tool](https://github.com/OpenDataArena/OpenDataArena-Tool) offers an open-source pipeline for dataset curation and scoring. If you find our work helpful, please consider **⭐ starring and subscribing** to support our research. --- ## 🚀 Usage Model repo: `OpenDataArena/Qwen2.5-7B-ODA-Math-460k`. Below is a minimal runnable example for loading and inference: ```python from transformers import AutoModelForCausalLM, AutoTokenizer MODEL_ID = "OpenDataArena/Qwen2.5-7B-ODA-Math-460k" tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", trust_remote_code=True) messages = [ {"role": "user", "content": "Solve: If f(x)=x^2+1, what is f(3)?"}, ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer([text], return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9, ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` --- ## 📚 Citation ```bibtex @article{gao2025closing, title={Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets}, author={Gao, Xin and Wang, Xiaoyang and Zhu, Yun and Cai, Mengzhang and He, Conghui and Wu, Lijun}, journal={arXiv preprint arXiv:2601.09733}, year={2025} } ``` ```bibtex @article{cai2025opendataarena, title={OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value}, author={Cai, Mengzhang and Gao, Xin and Li, Yu and Lin, Honglin and Liu, Zheng and Pan, Zhuoshi and Pei, Qizhi and Shang, Xiaoran and Sun, Mengyuan and Tang, Zinan and others}, journal={arXiv preprint arXiv:2512.14051}, year={2025} } ```