Instructions to use OpenDataArena/Qwen2.5-7B-ODA-Math-460k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OpenDataArena/Qwen2.5-7B-ODA-Math-460k with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="OpenDataArena/Qwen2.5-7B-ODA-Math-460k")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("OpenDataArena/Qwen2.5-7B-ODA-Math-460k")
model = AutoModelForCausalLM.from_pretrained("OpenDataArena/Qwen2.5-7B-ODA-Math-460k")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use OpenDataArena/Qwen2.5-7B-ODA-Math-460k with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "OpenDataArena/Qwen2.5-7B-ODA-Math-460k"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenDataArena/Qwen2.5-7B-ODA-Math-460k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/OpenDataArena/Qwen2.5-7B-ODA-Math-460k

SGLang

How to use OpenDataArena/Qwen2.5-7B-ODA-Math-460k with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "OpenDataArena/Qwen2.5-7B-ODA-Math-460k" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenDataArena/Qwen2.5-7B-ODA-Math-460k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "OpenDataArena/Qwen2.5-7B-ODA-Math-460k" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenDataArena/Qwen2.5-7B-ODA-Math-460k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use OpenDataArena/Qwen2.5-7B-ODA-Math-460k with Docker Model Runner:
```
docker model run hf.co/OpenDataArena/Qwen2.5-7B-ODA-Math-460k
```

Qwen2.5-7B-ODA-Math-460k / README.md

GX-XinGao

Update README.md

0831cc3 verified 5 months ago

preview code

Raw

History Blame Contribute Delete

16.8 kB

	---
	base_model: Qwen/Qwen2.5-7B-Base
	library_name: transformers
	pipeline_tag: text-generation
	datasets:
	- OpenDataArena/ODA-Math-460k
	tags:
	- qwen2.5
	- sft
	- opendataarena
	- oda-math
	- math
	- reasoning
	license: cc-by-nc-4.0
	language:
	- en
	metrics:
	- accuracy
	---

	# Qwen2.5-7B-ODA-Math-460k
	<img src="performance.png" alt="Leaderboard Performance" width="1200" />

	Qwen2.5-7B-ODA-Math-460k is a supervised fine-tuned (SFT) model built on top of Qwen2.5-7B-Base, trained with [ODA-Math-460k](https://huggingface.co/datasets/OpenDataArena/ODA-Math-460k).

	ODA-Math-460k is a large-scale math reasoning dataset curated from top-performing open mathematics corpora (selected via the [OpenDataArena](https://opendataarena.github.io) leaderboard) and refined through deduplication, benchmark decontamination, LLM-based filtering, and verifier-backed response distillation.
	It targets a “learnable but challenging” difficulty band: non-trivial for smaller models yet solvable by stronger reasoning models.

	---

	## 🧠 Model Summary

	- Base Model: `Qwen/Qwen2.5-7B-Base`
	- Training Data: `OpenDataArena/ODA-Math-460k`
	- Domain Coverage: Mathematics (strictly filtered)
	- Scale (selected training set): ~460K problems (after selection and verification pipeline)
	- Goal: Efficiently improve mathematical reasoning and competition-style problem solving via high-quality, validated solutions.

	---

	## ⚙️ Training Data Curation Pipeline

	ODA-Math-460k is constructed from an aggregated question pool and then progressively filtered and selected.

	### 1️⃣ Data Collection

	We prioritize source datasets based on their empirical impact on downstream model performance. Using the OpenDataArena leaderboard, we aggregate top-ranking math datasets that show strong efficacy for the Qwen and Llama model families. These sources form the initial pool for ODA-Math.

	### 2️⃣ Deduplication & Decontamination

	We first perform exact deduplication over all questions to remove identical items, and then run benchmark decontamination to reduce evaluation leakage by removing overlaps with standard and competition benchmarks.

	### 3️⃣ Question Filtering (Quality & Suitability)

	A multi-stage filtering pipeline refines domain specificity and usability by applying an LLM-based domain classifier (to remove out-of-domain items such as coding/general instruction tasks), an LLM-based validity validator (to remove ill-formed questions with missing premises or undefined notation), and problem-type filtering (via the Big Math toolkit) to exclude proof questions and guessing-prone formats like multiple-choice and true/false—leaving predominantly free-form problems with objectively verifiable answers.

	### 📊 Filtration Statistics

	\| Pipeline Stage \| Count \| Percentage \|
	\|---\|---:\|---:\|
	\| Raw Collection \| 11.4M \| 100% \|
	\| Dedup & Decontamination \| 4.3M \| 37.7% \|
	\| Question Filtering \| 3.3M \| 28.9% \|
	\| Stage-1 Filtering \| 815.3K \| 7.2% \|
	\| Stage-2 Filtering \| 459.6K \| 4.0% \|

	---

	## 🎯 Data Selection

	Given the large curated pool, ODA-Math-460k retains problems that are hard for small models but solvable for stronger reasoning models.

	### Stage-1: Lower-Bound Filtering

	Stage-1 removes trivial problems using Qwen3-8B in non-thinking mode: for each problem we sample k=4 responses, compute Pass@4 by matching each predicted final answer to y_gt, and keep the problem only if Pass@4(x) = 0 (i.e., none of four attempts is correct).

	### Stage-2: Upper-Bound Filtering

	Stage-2 removes unsolvable or ambiguous problems using Qwen3-30B-A3B in thinking mode: we generate k=5 reasoning traces per problem, compute Pass@5, and keep the problem only if Pass@5(x) > 0 (i.e., at least one attempt solves it).

	---

	## ✅ Distillation & Verification

	### 🧪 Response Synthesis

	We distill solutions using AM-Thinking-v1 as the teacher, generating k=5 candidate reasoning traces (step-by-step solution + final answer) for each selected problem.

	### 🔍 Response Verification

	We verify generated responses with Compass-Verifier-7B, which takes (problem x, generated response y_gen, ground-truth answer y_gt) and outputs a binary correctness decision (correct / incorrect). We keep only the (problem, response) pairs judged correct, and discard the rest—so the released dataset contains verified solutions only.

	---

	## 📚 Training Data Source Composition

	ODA-Math-460k is a mixture of multiple high-quality math datasets to avoid domination by a single style/annotation protocol. Top contributors:

	\| Source \| Count \| Percentage \|
	\|---\|---:\|---:\|
	\| ScaleQuest-Math \| 87,755 \| 19.09% \|
	\| NuminaMath-CoT \| 75,971 \| 16.53% \|
	\| OpenMathInstruct-2 \| 65,688 \| 14.29% \|
	\| MegaScience (math) \| 54,904 \| 11.94% \|
	\| OpenMathReasoning \| 49,463 \| 10.76% \|
	\| AM-Thinking-Distilled \| 38,375 \| 8.35% \|
	\| MiroMind-M1-SFT-719K \| 23,417 \| 5.09% \|
	\| SCP-116K \| 16,066 \| 3.50% \|
	\| DeepMath-309K \| 11,956 \| 2.60% \|
	\| math-gpt-4o-200k \| 8,355 \| 1.82% \|
	\| OpenR1-Math-220k \| 7,999 \| 1.74% \|
	\| MathFusionQA \| 6,510 \| 1.42% \|

	---

	## 🔬 Content Characteristics

	### 📘 Subject Distribution

	<img src="math_oda_subject_distribution_pie.png" alt="Subject Distribution" width="600" />

	ODA-Math-460k maintains a more balanced subject composition than several peers:
	- Algebra remains substantial (~44.8%),
	- Geometry roughly 20–22%,
	- Calculus, Discrete Math & Probability, and Number Theory each around ~11%.

	This mitigates subject bias and reduces performance drops on underrepresented topics.

	### 📉 Difficulty Distribution

	Apart from model-based pass rate, we also adopt LLM-as-Judge difficulty estimation on a 1-10 scale, mapped to the [AoPS ratings](https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings).

	\| Level \| Equivalent Competition Tier \| Description \|
	\| :--- \| :--- \| :--- \|
	\| 1 \| Elementary / Middle School \| MOEMS, AMC 8 (Early Qs). Standard word problems. \|
	\| 2 \| Junior High \| AMC 8 (Hard), AMC 10 (Early). Complex word problems. \|
	\| 3 \| High School Beginner \| AMC 10 (Mid), AMC 12 (Early). Requires creative thinking. \|
	\| 4 \| High School Intermediate \| AMC 12 (Mid), AIME (Early). Intermediate complexity. \|
	\| 5 \| Advanced High School \| AIME (Mid), JBMO. Simple proof-based Olympiad style. \|
	\| 6 \| Pre-Olympiad \| AIME (Hard), USAJMO. Introductory Olympiad level. \|
	\| 7 \| Olympiad (Entry) \| IMO (Easy/Medium), USAMO. Requires technical knowledge. \|
	\| 8 \| Olympiad (Medium) \| IMO (Medium/Hard). High-level competition problems. \|
	\| 9 \| Olympiad (Expert) \| IMO (Hard). Expert-level constructions/proofs. \|
	\| 10 \| Historically Hard \| Outliers. Exceedingly tedious or difficult even for Olympians. \|

	<img src="math_oda_difficulty_distribution.png" alt="Difficulty Distribution" width="600" />

	ODA-Math-460k features a balanced mix of fundamental and intermediate reasoning tasks:

	- Primary Mode: Difficulty 1 (~110k samples), providing a dense foundation of basic mathematical concepts.
	- Secondary Mode: Difficulty 6 (~72k samples), offering a significant concentration of intermediate-level challenges.
	- Tail: A steady decline toward Difficulty 10, maintaining a specialized set of high-complexity queries.

	---

	## 📈 Performance

	ODA-Math-460k is evaluated as an SFT corpus for Qwen2.5-7B-Base.

	Results show consistent gains over base checkpoints, with particularly strong improvements on competition-style benchmarks.

	<div style="overflow-x: auto; font-family: sans-serif; margin-bottom: 20px;">
	<table style="width: 100%; border-collapse: collapse; text-align: center; font-size: 14px; min-width: 900px; color: inherit;">
	<caption style="padding: 10px; font-weight: bold;">Performance Comparison. Best scores in <b>bold</b>, second-best <u>underlined</u>.</caption>
	<thead>
	<tr style="border-top: 2px solid currentColor; border-bottom: 1px solid currentColor;">
	<th style="text-align: left; padding: 8px;">Dataset</th>
	<th>Size</th>
	<th>GSM8K</th>
	<th>Math500</th>
	<th>Omni-Math</th>
	<th>Olympiad</th>
	<th>AIME'24</th>
	<th>AIME'25</th>
	<th>CMIMC'25</th>
	<th>HMMT'25</th>
	<th>BRUMO'25</th>
	<th style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><b>AVG</b></th>
	</tr>
	</thead>
	<tbody>
	<tr style="border-top: 1px solid currentColor; background-color: rgba(128, 128, 128, 0.08); font-weight: bold;">
	<td colspan="12" style="text-align: center; padding: 10px 8px; letter-spacing: 1px;">Qwen2.5-7B-Base</td>
	</tr>
	<tr>
	<td style="text-align: left; padding: 8px;">Qwen2.5-7B-Base</td>
	<td>-</td><td>80.0</td><td>50.2</td><td>26.0</td><td>35.9</td><td>6.7</td><td>6.7</td><td>10.0</td><td>0.0</td><td>20.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">26.2</td>
	</tr>
	<tr>
	<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/GAIR/LIMO">LIMO</a></td>
	<td>817</td><td>92.1</td><td>66.8</td><td>21.6</td><td>34.9</td><td>4.6</td><td>1.7</td><td>0.0</td><td>0.0</td><td>5.4</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">25.2</td>
	</tr>
	<tr>
	<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/nvidia/OpenMathInstruct-2">OpenMathInstruct-2</a></td>
	<td>1M</td><td>91.6</td><td>65.9</td><td>22.5</td><td>30.7</td><td>6.7</td><td>5.0</td><td>5.0</td><td>0.0</td><td>13.6</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">26.8</td>
	</tr>
	<tr>
	<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/MegaScience/MegaScience">MegaScience (math)</a></td>
	<td>414k</td><td>90.1</td><td>77.8</td><td>28.7</td><td>44.5</td><td>16.7</td><td>15.0</td><td>8.1</td><td>0.0</td><td>26.7</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">34.2</td>
	</tr>
	<tr>
	<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/RabotniKuma/Fast-Math-R1-SFT">Fast-Math-R1-SFT</a></td>
	<td>8k</td><td>90.6</td><td>80.0</td><td>35.8</td><td>50.3</td><td>23.3</td><td>26.7</td><td>7.5</td><td>8.3</td><td>31.7</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">39.4</td>
	</tr>
	<tr>
	<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/zwhe99/DeepMath-103K">DeepMath-103K</a></td>
	<td>103k</td><td>92.1</td><td>92.0</td><td>45.4</td><td>60.2</td><td>34.2</td><td>31.7</td><td>10.0</td><td>11.7</td><td>15.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">43.6</td>
	</tr>
	<tr>
	<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/qihoo360/Light-R1-SFTData">Light-R1-SFT</a></td>
	<td>79k</td><td>92.0</td><td>88.0</td><td>43.3</td><td>60.2</td><td>38.3</td><td>26.7</td><td>22.5</td><td>13.3</td><td>38.3</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">47.0</td>
	</tr>
	<tr>
	<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/PrimeIntellect/SYNTHETIC-2-SFT-verified">SYNTHETIC-2 (math)</a></td>
	<td>50k</td><td>92.1</td><td>90.0</td><td>54.5</td><td>67.4</td><td>45.0</td><td>35.0</td><td>19.7</td><td>20.0</td><td>36.7</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">51.2</td>
	</tr>
	<tr>
	<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/miromind-ai/MiroMind-M1-SFT-719K">MiroMind-M1-SFT</a></td>
	<td>719k</td><td><u>93.9</u></td><td>91.6</td><td>48.1</td><td>66.3</td><td>55.0</td><td>30.0</td><td>27.5</td><td>18.3</td><td>50.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">53.4</td>
	</tr>
	<tr>
	<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/alibaba-pai/OmniThought-0528">OmniThought-0528</a></td>
	<td>365k</td><td>93.2</td><td>89.8</td><td>54.3</td><td>68.1</td><td>50.4</td><td>40.0</td><td>25.0</td><td>28.3</td><td>45.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">54.9</td>
	</tr>
	<tr>
	<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M">OpenThoughts3</a></td>
	<td>1.2M</td><td>91.7</td><td>93.8</td><td>44.8</td><td>68.8</td><td><u>60.0</u></td><td>45.0</td><td>27.5</td><td>31.7</td><td>50.0</td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);">57.0</td>
	</tr>
	<tr>
	<td style="text-align: left; padding: 8px;"><a href="https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled">AM-Thinking (math)</a></td>
	<td>558k</td><td>92.9</td><td><b>96.2</b></td><td><u>60.6</u></td><td><b>74.2</b></td><td><b>63.3</b></td><td><u>50.0</u></td><td><u>27.8</u></td><td><u>36.7</u></td><td><b>63.3</b></td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><u>62.8</u></td>
	</tr>
	<tr style="background-color: rgba(128, 128, 128, 0.18); font-weight: bold; border-bottom: 2px solid currentColor;">
	<td style="text-align: left; padding: 8px;">ODA-Math</td>
	<td>460k</td><td><b>94.3</b></td><td><u>95.4</u></td><td><b>62.6</b></td><td><u>70.9</u></td><td>56.7</td><td><b>56.7</b></td><td><b>35.0</b></td><td><b>45.0</b></td><td><u>60.0</u></td><td style="border-left: 1px solid rgba(128, 128, 128, 0.3);"><b>64.1</b></td>
	</tr>
	</tbody>
	</table>
	</div>

	---

	## 🌐 About OpenDataArena

	[OpenDataArena](https://opendataarena.github.io/) is an open research platform dedicated to discovering, evaluating, and advancing high-quality datasets for AI post-training. It provides a transparent, data-centric ecosystem to support reproducible dataset evaluation and sharing.

	Key Features:
	- 🏆 Dataset Leaderboard — helps researchers identify the most valuable and high-quality datasets across different domains.
	- 📊 Detailed Evaluation Scores — provides comprehensive metrics to assess data quality, complexity, difficulty etc.
	- 🧰 Data Processing Toolkit — [OpenDataArena-Tool](https://github.com/OpenDataArena/OpenDataArena-Tool)
	offers an open-source pipeline for dataset curation and scoring.

	If you find our work helpful, please consider ⭐ starring and subscribing to support our research.

	---

	## 🚀 Usage

	Model repo: `OpenDataArena/Qwen2.5-7B-ODA-Math-460k`. Below is a minimal runnable example for loading and inference:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	MODEL_ID = "OpenDataArena/Qwen2.5-7B-ODA-Math-460k"

	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", trust_remote_code=True)

	messages = [
	{"role": "user", "content": "Solve: If f(x)=x^2+1, what is f(3)?"},
	]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer([text], return_tensors="pt").to(model.device)

	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	do_sample=True,
	temperature=0.7,
	top_p=0.9,
	)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	---

	## 📚 Citation

	```bibtex
	@article{gao2025closing,
	title={Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets},
	author={Gao, Xin and Wang, Xiaoyang and Zhu, Yun and Cai, Mengzhang and He, Conghui and Wu, Lijun},
	journal={arXiv preprint arXiv:2601.09733},
	year={2025}
	}
	```
	```bibtex
	@article{cai2025opendataarena,
	title={OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value},
	author={Cai, Mengzhang and Gao, Xin and Li, Yu and Lin, Honglin and Liu, Zheng and Pan, Zhuoshi and Pei, Qizhi and Shang, Xiaoran and Sun, Mengyuan and Tang, Zinan and others},
	journal={arXiv preprint arXiv:2512.14051},
	year={2025}
	}
	```