Text Generation
Transformers
Safetensors
English
qwen2
qwen2.5
sft
opendataarena
oda-math
math
reasoning
conversational
text-generation-inference
Instructions to use OpenDataArena/Qwen2.5-7B-ODA-Math-460k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenDataArena/Qwen2.5-7B-ODA-Math-460k with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="OpenDataArena/Qwen2.5-7B-ODA-Math-460k") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("OpenDataArena/Qwen2.5-7B-ODA-Math-460k") model = AutoModelForCausalLM.from_pretrained("OpenDataArena/Qwen2.5-7B-ODA-Math-460k") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use OpenDataArena/Qwen2.5-7B-ODA-Math-460k with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "OpenDataArena/Qwen2.5-7B-ODA-Math-460k" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenDataArena/Qwen2.5-7B-ODA-Math-460k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/OpenDataArena/Qwen2.5-7B-ODA-Math-460k
- SGLang
How to use OpenDataArena/Qwen2.5-7B-ODA-Math-460k with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "OpenDataArena/Qwen2.5-7B-ODA-Math-460k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenDataArena/Qwen2.5-7B-ODA-Math-460k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "OpenDataArena/Qwen2.5-7B-ODA-Math-460k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenDataArena/Qwen2.5-7B-ODA-Math-460k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use OpenDataArena/Qwen2.5-7B-ODA-Math-460k with Docker Model Runner:
docker model run hf.co/OpenDataArena/Qwen2.5-7B-ODA-Math-460k
| base_model: Qwen/Qwen2.5-7B-Base | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| datasets: | |
| - OpenDataArena/ODA-Math-460k | |
| tags: | |
| - qwen2.5 | |
| - sft | |
| - opendataarena | |
| - oda-math | |
| - math | |
| - reasoning | |
| license: cc-by-nc-4.0 | |
| language: | |
| - en | |
| metrics: | |
| - accuracy | |
| # Qwen2.5-7B-ODA-Math-460k | |
| <img src="performance.png" alt="Leaderboard Performance" width="1200" /> | |
| Qwen2.5-7B-ODA-Math-460k is a supervised fine-tuned (SFT) model built on top of **Qwen2.5-7B-Base**, trained with **[ODA-Math-460k](https://huggingface.co/datasets/OpenDataArena/ODA-Math-460k)**. | |
| ODA-Math-460k is a large-scale math reasoning dataset curated from top-performing open mathematics corpora (selected via the *[OpenDataArena](https://opendataarena.github.io)* leaderboard) and refined through **deduplication**, **benchmark decontamination**, **LLM-based filtering**, and **verifier-backed response distillation**. | |
| It targets a β**learnable but challenging**β difficulty band: non-trivial for smaller models yet solvable by stronger reasoning models. | |
| --- | |
| ## π§ Model Summary | |
| - **Base Model**: `Qwen/Qwen2.5-7B-Base` | |
| - **Training Data**: `OpenDataArena/ODA-Math-460k` | |
| - **Domain Coverage**: Mathematics (strictly filtered) | |
| - **Scale (selected training set)**: ~**460K** problems (after selection and verification pipeline) | |
| - **Goal**: Efficiently improve mathematical reasoning and competition-style problem solving via high-quality, validated solutions. | |
| --- | |
| ## βοΈ Training Data Curation Pipeline | |
| ODA-Math-460k is constructed from an aggregated question pool and then progressively filtered and selected. | |
| ### 1οΈβ£ Data Collection | |
| We prioritize source datasets based on their empirical impact on downstream model performance. Using the *OpenDataArena* leaderboard, we aggregate top-ranking math datasets that show strong efficacy for the **Qwen** and **Llama** model families. These sources form the initial pool for ODA-Math. | |
| ### 2οΈβ£ Deduplication & Decontamination | |
| We first perform **exact deduplication** over all questions to remove identical items, and then run **benchmark decontamination** to reduce evaluation leakage by removing overlaps with standard and competition benchmarks. | |
| ### 3οΈβ£ Question Filtering (Quality & Suitability) | |
| A multi-stage filtering pipeline refines domain specificity and usability by applying an LLM-based **domain classifier** (to remove out-of-domain items such as coding/general instruction tasks), an LLM-based **validity validator** (to remove ill-formed questions with missing premises or undefined notation), and **problem-type filtering** (via the *Big Math* toolkit) to exclude proof questions and guessing-prone formats like multiple-choice and true/falseβleaving predominantly **free-form** problems with objectively verifiable answers. | |
| ### π Filtration Statistics | |
| | Pipeline Stage | Count | Percentage | | |
| |---|---:|---:| | |
| | Raw Collection | 11.4M | 100% | | |
| | Dedup & Decontamination | 4.3M | 37.7% | | |
| | Question Filtering | 3.3M | 28.9% | | |
| | Stage-1 Filtering | 815.3K | 7.2% | | |
| | Stage-2 Filtering | 459.6K | 4.0% | | |
| --- | |
| ## π― Data Selection | |
| Given the large curated pool, ODA-Math-460k retains problems that are **hard for small models** but **solvable for stronger reasoning models**. | |
| ### Stage-1: Lower-Bound Filtering | |
| Stage-1 removes trivial problems using **Qwen3-8B** in *non-thinking* mode: for each problem we sample **k=4** responses, compute **Pass@4** by matching each predicted final answer to **y_gt**, and keep the problem **only if** **Pass@4(x) = 0** (i.e., none of four attempts is correct). | |
| ### Stage-2: Upper-Bound Filtering | |
| Stage-2 removes unsolvable or ambiguous problems using **Qwen3-30B-A3B** in *thinking* mode: we generate **k=5** reasoning traces per problem, compute **Pass@5**, and keep the problem **only if** **Pass@5(x) > 0** (i.e., at least one attempt solves it). | |
| --- | |
| ## β Distillation & Verification | |
| ### π§ͺ Response Synthesis | |
| We distill solutions using **AM-Thinking-v1** as the teacher, generating **k=5** candidate reasoning traces (step-by-step solution + final answer) for each selected problem. | |
| ### π Response Verification | |
| We verify generated responses with **Compass-Verifier-7B**, which takes (problem **x**, generated response **y_gen**, ground-truth answer **y_gt**) and outputs a binary correctness decision (**correct** / **incorrect**). We keep only the (problem, response) pairs judged **correct**, and discard the restβso the released dataset contains **verified solutions only**. | |
| --- | |
| ## π Training Data Source Composition | |
| ODA-Math-460k is a mixture of multiple high-quality math datasets to avoid domination by a single style/annotation protocol. Top contributors: | |
| | Source | Count | Percentage | | |
| |---|---:|---:| | |
| | ScaleQuest-Math | 87,755 | 19.09% | | |
| | NuminaMath-CoT | 75,971 | 16.53% | | |
| | OpenMathInstruct-2 | 65,688 | 14.29% | | |
| | MegaScience (math) | 54,904 | 11.94% | | |
| | OpenMathReasoning | 49,463 | 10.76% | | |
| | AM-Thinking-Distilled | 38,375 | 8.35% | | |
| | MiroMind-M1-SFT-719K | 23,417 | 5.09% | | |
| | SCP-116K | 16,066 | 3.50% | | |
| | DeepMath-309K | 11,956 | 2.60% | | |
| | math-gpt-4o-200k | 8,355 | 1.82% | | |
| | OpenR1-Math-220k | 7,999 | 1.74% | | |
| | MathFusionQA | 6,510 | 1.42% | | |
| --- | |
| ## π¬ Content Characteristics | |
| ### π Subject Distribution | |
| <img src="math_oda_subject_distribution_pie.png" alt="Subject Distribution" width="600" /> | |
| ODA-Math-460k maintains a **more balanced** subject composition than several peers: | |
| - Algebra remains substantial (**~44.8%**), | |
| - Geometry roughly **20β22%**, | |
| - Calculus, Discrete Math & Probability, and Number Theory each around **~11%**. | |
| This mitigates subject bias and reduces performance drops on underrepresented topics. | |
| ### π Difficulty Distribution | |
| Apart from model-based pass rate, we also adopt LLM-as-Judge difficulty estimation on a **1-10 scale**, mapped to the [AoPS ratings](https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings). | |
| | Level | Equivalent Competition Tier | Description | | |
| | :--- | :--- | :--- | | |
| | **1** | **Elementary / Middle School** | MOEMS, AMC 8 (Early Qs). Standard word problems. | | |
| | **2** | **Junior High** | AMC 8 (Hard), AMC 10 (Early). Complex word problems. | | |
| | **3** | **High School Beginner** | AMC 10 (Mid), AMC 12 (Early). Requires creative thinking. | | |
| | **4** | **High School Intermediate** | AMC 12 (Mid), AIME (Early). Intermediate complexity. | | |
| | **5** | **Advanced High School** | AIME (Mid), JBMO. Simple proof-based Olympiad style. | | |
| | **6** | **Pre-Olympiad** | AIME (Hard), USAJMO. Introductory Olympiad level. | | |
| | **7** | **Olympiad (Entry)** | IMO (Easy/Medium), USAMO. Requires technical knowledge. | | |
| | **8** | **Olympiad (Medium)** | IMO (Medium/Hard). High-level competition problems. | | |
| | **9** | **Olympiad (Expert)** | IMO (Hard). Expert-level constructions/proofs. | | |
| | **10** | **Historically Hard** | Outliers. Exceedingly tedious or difficult even for Olympians. | | |
| <img src="math_oda_difficulty_distribution.png" alt="Difficulty Distribution" width="600" /> | |
| ODA-Math-460k features a balanced mix of fundamental and intermediate reasoning tasks: | |
| - Primary Mode: Difficulty 1 (~110k samples), providing a dense foundation of basic mathematical concepts. | |
| - Secondary Mode: Difficulty 6 (~72k samples), offering a significant concentration of intermediate-level challenges. | |
| - Tail: A steady decline toward Difficulty 10, maintaining a specialized set of high-complexity queries. | |
| --- | |
| ## π Performance | |
| ODA-Math-460k is evaluated as an SFT corpus for **Qwen2.5-7B-Base**. | |
| Results show consistent gains over base checkpoints, with particularly strong improvements on **competition-style** benchmarks. | |
| <div style="overflow-x: auto; font-family: sans-serif; margin-bottom: 20px;"> | |
| <table style="width: 100%; border-collapse: collapse; text-align: center; font-size: 14px; min-width: 900px; color: inherit;"> | |
| <caption style="padding: 10px; font-weight: bold;">Performance Comparison. Best scores in <b>bold</b>, second-best <u>underlined</u>.</caption> | |
| <thead> | |
| <tr style="border-top: 2px solid currentColor; border-bottom: 1px solid currentColor;"> | |
| <th style="text-align: left; padding: 8px;">Dataset</th> | |
| <th>Size</th> | |
| <th>GSM8K</th> | |
| <th>Math500</th> | |
| <th>Omni-Math</th> | |
| <th>Olympiad</th> | |
| <th>AIME'24</th> | |
| <th>AIME'25</th> | |
| <th>CMIMC'25</th> | |
| <th>HMMT'25</th> | |
| <th>BRUMO'25</th> | |
| <th style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><b>AVG</b></th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr style="border-top: 1px solid currentColor; background-color: rgba(128, 128, 128, 0.08); font-weight: bold;"> | |
| <td colspan="12" style="text-align: center; padding: 10px 8px; letter-spacing: 1px;">Qwen2.5-7B-Base</td> | |
| </tr> | |
| <tr> | |
| <td style="text-align: left; padding: 8px;">Qwen2.5-7B-Base</td> | |
| <td>-</td><td>80.0</td><td>50.2</td><td>26.0</td><td>35.9</td><td>6.7</td><td>6.7</td><td>10.0</td><td>0.0</td><td>20.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">26.2</td> | |
| </tr> | |
| <tr> | |
| <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/GAIR/LIMO">LIMO</a></td> | |
| <td>817</td><td>92.1</td><td>66.8</td><td>21.6</td><td>34.9</td><td>4.6</td><td>1.7</td><td>0.0</td><td>0.0</td><td>5.4</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">25.2</td> | |
| </tr> | |
| <tr> | |
| <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/nvidia/OpenMathInstruct-2">OpenMathInstruct-2</a></td> | |
| <td>1M</td><td>91.6</td><td>65.9</td><td>22.5</td><td>30.7</td><td>6.7</td><td>5.0</td><td>5.0</td><td>0.0</td><td>13.6</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">26.8</td> | |
| </tr> | |
| <tr> | |
| <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/MegaScience/MegaScience">MegaScience (math)</a></td> | |
| <td>414k</td><td>90.1</td><td>77.8</td><td>28.7</td><td>44.5</td><td>16.7</td><td>15.0</td><td>8.1</td><td>0.0</td><td>26.7</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">34.2</td> | |
| </tr> | |
| <tr> | |
| <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/RabotniKuma/Fast-Math-R1-SFT">Fast-Math-R1-SFT</a></td> | |
| <td>8k</td><td>90.6</td><td>80.0</td><td>35.8</td><td>50.3</td><td>23.3</td><td>26.7</td><td>7.5</td><td>8.3</td><td>31.7</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">39.4</td> | |
| </tr> | |
| <tr> | |
| <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/zwhe99/DeepMath-103K">DeepMath-103K</a></td> | |
| <td>103k</td><td>92.1</td><td>92.0</td><td>45.4</td><td>60.2</td><td>34.2</td><td>31.7</td><td>10.0</td><td>11.7</td><td>15.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">43.6</td> | |
| </tr> | |
| <tr> | |
| <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/qihoo360/Light-R1-SFTData">Light-R1-SFT</a></td> | |
| <td>79k</td><td>92.0</td><td>88.0</td><td>43.3</td><td>60.2</td><td>38.3</td><td>26.7</td><td>22.5</td><td>13.3</td><td>38.3</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">47.0</td> | |
| </tr> | |
| <tr> | |
| <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-2-SFT-verified">SYNTHETIC-2 (math)</a></td> | |
| <td>50k</td><td>92.1</td><td>90.0</td><td>54.5</td><td>67.4</td><td>45.0</td><td>35.0</td><td>19.7</td><td>20.0</td><td>36.7</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">51.2</td> | |
| </tr> | |
| <tr> | |
| <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/miromind-ai/MiroMind-M1-SFT-719K">MiroMind-M1-SFT</a></td> | |
| <td>719k</td><td><u>93.9</u></td><td>91.6</td><td>48.1</td><td>66.3</td><td>55.0</td><td>30.0</td><td>27.5</td><td>18.3</td><td>50.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">53.4</td> | |
| </tr> | |
| <tr> | |
| <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/alibaba-pai/OmniThought-0528">OmniThought-0528</a></td> | |
| <td>365k</td><td>93.2</td><td>89.8</td><td>54.3</td><td>68.1</td><td>50.4</td><td>40.0</td><td>25.0</td><td>28.3</td><td>45.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">54.9</td> | |
| </tr> | |
| <tr> | |
| <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M">OpenThoughts3</a></td> | |
| <td>1.2M</td><td>91.7</td><td>93.8</td><td>44.8</td><td>68.8</td><td><u>60.0</u></td><td>45.0</td><td>27.5</td><td>31.7</td><td>50.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">57.0</td> | |
| </tr> | |
| <tr> | |
| <td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled">AM-Thinking (math)</a></td> | |
| <td>558k</td><td>92.9</td><td><b>96.2</b></td><td><u>60.6</u></td><td><b>74.2</b></td><td><b>63.3</b></td><td><u>50.0</u></td><td><u>27.8</u></td><td><u>36.7</u></td><td><b>63.3</b></td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><u>62.8</u></td> | |
| </tr> | |
| <tr style="background-color: rgba(128, 128, 128, 0.18); font-weight: bold; border-bottom: 2px solid currentColor;"> | |
| <td style="text-align: left; padding: 8px;">ODA-Math</td> | |
| <td>460k</td><td><b>94.3</b></td><td><u>95.4</u></td><td><b>62.6</b></td><td><u>70.9</u></td><td>56.7</td><td><b>56.7</b></td><td><b>35.0</b></td><td><b>45.0</b></td><td><u>60.0</u></td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><b>64.1</b></td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| </div> | |
| --- | |
| ## π About OpenDataArena | |
| [OpenDataArena](https://opendataarena.github.io/) is an open research platform dedicated to **discovering, evaluating, and advancing high-quality datasets for AI post-training**. It provides a transparent, data-centric ecosystem to support reproducible dataset evaluation and sharing. | |
| **Key Features:** | |
| - π **Dataset Leaderboard** β helps researchers identify **the most valuable and high-quality datasets across different domains**. | |
| - π **Detailed Evaluation Scores** β provides **comprehensive metrics** to assess data quality, complexity, difficulty etc. | |
| - π§° **Data Processing Toolkit** β [OpenDataArena-Tool](https://github.com/OpenDataArena/OpenDataArena-Tool) | |
| offers an open-source pipeline for dataset curation and scoring. | |
| If you find our work helpful, please consider **β starring and subscribing** to support our research. | |
| --- | |
| ## π Usage | |
| Model repo: `OpenDataArena/Qwen2.5-7B-ODA-Math-460k`. Below is a minimal runnable example for loading and inference: | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| MODEL_ID = "OpenDataArena/Qwen2.5-7B-ODA-Math-460k" | |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", trust_remote_code=True) | |
| messages = [ | |
| {"role": "user", "content": "Solve: If f(x)=x^2+1, what is f(3)?"}, | |
| ] | |
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = tokenizer([text], return_tensors="pt").to(model.device) | |
| outputs = model.generate( | |
| **inputs, | |
| max_new_tokens=512, | |
| do_sample=True, | |
| temperature=0.7, | |
| top_p=0.9, | |
| ) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| --- | |
| ## π Citation | |
| ```bibtex | |
| @article{gao2025closing, | |
| title={Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets}, | |
| author={Gao, Xin and Wang, Xiaoyang and Zhu, Yun and Cai, Mengzhang and He, Conghui and Wu, Lijun}, | |
| journal={arXiv preprint arXiv:2601.09733}, | |
| year={2025} | |
| } | |
| ``` | |
| ```bibtex | |
| @article{cai2025opendataarena, | |
| title={OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value}, | |
| author={Cai, Mengzhang and Gao, Xin and Li, Yu and Lin, Honglin and Liu, Zheng and Pan, Zhuoshi and Pei, Qizhi and Shang, Xiaoran and Sun, Mengyuan and Tang, Zinan and others}, | |
| journal={arXiv preprint arXiv:2512.14051}, | |
| year={2025} | |
| } | |
| ``` | |