How to use from
Docker Model Runner
docker model run hf.co/GX-XinGao/Qwen3-8B-ODA-R-select-100k
Quick Links

Qwen3-8B-R-Select-100k

Paper Github Collections

R-Select

Qwen3-8B-R-Select-100k is a supervised fine-tuned (SFT) model built on top of Qwen3-8B-Base, trained with R-Select-100k.

🧠 Model Summary

  • Base Model: Qwen/Qwen3-8B-Base
  • Training Data: OpenDataArena/R-Select-100k
  • Domain Coverage: General, Math, Code, Reasoning
  • Scale (selected training set): 100K samples

📄 About R-Select

R-Select: A Robust Multi-Metric Data Selection Approach for Fine-Tuning Large Language Models is a KDD 2026 paper that studies how to select high-quality SFT data from large heterogeneous instruction-tuning pools. R-Select is designed to move beyond single-metric filtering or manually designed aggregation rules by formulating data selection as a multi-metric weight optimization problem.

Specifically, R-Select annotates each sample with 30 quality metrics, clusters correlated metrics into functional groups, and learns a hierarchical selection policy through proxy-model validation. The learned policy is then applied to a source pool of over 3.4M samples to select 100K high-value samples for downstream SFT.

Code: https://github.com/OpenDataArena/R-Select


⚙️ Data Curation Pipeline

R-Select Overview of R-Select.

R-Select-100k is built by applying the R-Select data selection framework to a large-scale heterogeneous SFT data pool.

1️⃣ Source Pool Construction

We construct a candidate pool from 21 publicly available instruction-tuning datasets, covering diverse domains and paradigms such as reasoning, code, math, and general instruction following. The source pool includes datasets such as OpenThoughts3, AM-Thinking-v1-Distilled, OpenThoughts, OmniThought, LIMO.

2️⃣ Multi-Metric Annotation

Each sample is annotated with 30 complementary data-quality metrics using the OpenDataArena toolkit. These metrics cover three broad categories:

  • Model-based metrics: e.g., IFD, PPL, SkyworkRM.
  • Heuristic metrics: e.g., token length, token entropy, MTLD.
  • LLM-as-Judge: e.g., complexity.

Before optimization, metric values are standardized through outlier mitigation, Min-Max normalization, and score inversion for metrics where lower values indicate better quality. The scored version of the data has also been released at: OpenDataArena/OpenDataArena-scored-data-2603. Detailed metric information can also be found there.

3️⃣ Hierarchical Bayesian Optimization

R-Select formulates data selection as a learnable multi-metric weighting problem. Instead of using a single metric or manually designed aggregation rule, it learns how to combine metrics automatically.

The optimization process has two stages:

  • Intra-group refinement: correlated metrics are clustered with Ward hierarchical clustering, and local weights are optimized within each metric group.
  • Inter-group integration: group-level signals are combined through a second-stage optimization to balance global quality dimensions.

The search uses a lightweight proxy model, Qwen3-1.7B-Base, and the Optuna TPE optimizer. The proxy validation set contains 2,255 samples from Omni-Math, GPQA, MMLU, BigCodeBench, and MBPP.

4️⃣ Final Selection

After optimization, the learned weight vector is applied to the full candidate pool. Each sample receives a final scalar quality score, and the top 100K samples are selected to form R-Select-100k.


📚 Source Composition

Source Selected_count Percentage
AM-Thinking-v1-Distilled-math 28,434 28.43%
OpenThoughts 22,904 22.90%
AM-Thinking-v1-Distilled-code 12,670 12.67%
Tulu-3-Persona-MATH 8,693 8.69%
OpenO1-SFT 6,670 6.67%
OpenThoughts3 4,062 4.06%
NuminaMath-TIR 3,825 3.83%
Raiden-DeepSeek-R1 3,104 3.10%
OmniThought 1,735 1.74%
Magpie-Pro-GPT4o-mini 1,576 1.58%
Fast-Math-R1-SFT 1,493 1.49%
Tulu-3-Persona-Python 1,469 1.47%
Tulu-3-Persona-IF 1,403 1.40%
SYNTHETIC-2-SFT-verified 1,014 1.01%
Tulu-3-Persona-Algebra 428 0.43%
Tulu-3-Persona-GSM 190 0.19%
FLAN-v2 105 0.11%
LIMO 99 0.10%
Evol-CodeAlpaca 80 0.08%
SciRIFF 46 0.05%
No-Robots 0 0.00%

📚 Data Format

{
  "id": "unique_identifier",
  "source": "source_dataset",
  "instruction": "textual question or instruction",
  "outpt": "textual response",
  "scores": [
    "AtheneScore": ...,
    "CleanlinessScore": ...,
    ...
  ],
  "overall_score":...
}

📈 Performance

R-Select-100k is evaluated as an SFT corpus for both Qwen2.5-7B-Base and Qwen3-8B-Base. Evaluation is conducted with OpenCompass on unseen benchmarks across four domains:

  • General: DROP, IFEval, MMLU-Pro
  • Math: MATH500, OlympiadBench, AIME2024
  • Code: HumanEval, HumanEval+, LiveCodeBench v5
  • Reasoning: ARC-C, BBH, KOR-Bench

R-Select-100k achieves the best reported average performance among the compared open-source SFT datasets while using only 100K samples.

Comparison between our R-Select and representative open-source high-quality SFT datasets, Best scores in bold, second-best underlined.
Dataset Size DROP IFEval MMLU-P MATH500 OLYMP AIME'24 HE HE+ LCB ARC-C BBH K.B. AVG
Qwen2.5-7B-Base
Qwen2.5-7B-Base-68.335.544.250.235.96.777.443.38.236.669.533.345.8
LIMO81773.553.354.766.834.94.683.559.817.690.955.648.353.6
Light-R1-SFT79k79.438.546.388.060.238.345.740.23.278.072.752.353.6
SYNTHETIC-2-SFT105k64.454.935.590.067.445.042.140.211.193.281.158.857.0
OpenThoughts114k68.541.255.683.653.322.568.968.917.290.572.451.057.8
OmniThought365k52.834.338.189.868.150.457.951.817.990.576.858.957.2
MiroMind719k82.330.638.691.666.355.032.326.25.484.174.951.453.2
Tulu3-sft-mixture939k62.372.146.055.529.04.479.166.712.380.761.948.551.5
R-Select100k65.746.646.986.664.431.765.963.619.087.566.852.258.1
Qwen3-8B-Base
Qwen3-8B-Base-71.545.956.279.647.26.782.934.216.937.378.146.650.3
LIMO81772.052.349.869.031.312.581.162.213.683.448.745.751.8
Light-R1-SFT79k83.446.857.792.669.754.681.765.914.792.284.560.467.0
SYNTHETIC-2-SFT105k38.767.656.893.871.558.881.142.718.392.986.664.164.4
OpenThoughts114k79.643.638.792.271.547.972.075.031.591.282.357.765.2
OmniThought365k50.349.746.295.474.967.991.564.629.493.986.865.067.9
MiroMind719k85.043.555.196.877.062.982.371.321.292.986.562.069.7
Tulu3-sft-mixture939k56.369.644.756.434.44.272.062.815.185.464.747.051.0
R-Select100k81.856.461.194.471.156.781.175.628.092.980.659.169.9

🚀 Usage

Model repo: OpenDataArena/Qwen3-8B-R-Select-100k. Below is a minimal runnable example for loading and inference:

from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "OpenDataArena/Qwen3-8B-R-Select-100k"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🌐 About OpenDataArena

OpenDataArena is an open research platform for discovering, evaluating, and advancing high-quality datasets for AI post-training. R-Select uses the OpenDataArena toolkit for multi-metric data annotation and quality analysis.

Key Features:

  • 🏆 Dataset Leaderboard — helps researchers identify valuable and high-quality datasets across different domains.
  • 📊 Detailed Evaluation Scores — provides rich metric annotations and scored data to assess data quality, complexity, difficulty, and related properties.
  • 🧰 Data Scoring Toolkit — provides an open-source toolkit for scoring datasets with multiple quality metrics.
  • 🧬 Data Lineage — analyzes relationships among datasets by exploring their composition and source overlap.

If you find our work helpful, please consider ⭐ starring and subscribing to support our research.

📚 Citation

@inproceedings{gao2026rselect,
  title={R-Select: A Robust Multi-Metric Data Selection Approach for Fine-Tuning Large Language Models},
  author={Gao, Xin and Wang, Xiaoyang and Zhu, Yun and Liu, Zheng and He, Conghui and Wu, Lijun},
  booktitle={Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2},
  year={2026},
  publisher={ACM},
  doi={10.1145/3770855.3817656}
}
@article{cai2025opendataarena,
  title={OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value},
  author={Cai, Mengzhang and Gao, Xin and Li, Yu and Lin, Honglin and Liu, Zheng and Pan, Zhuoshi and Pei, Qizhi and Shang, Xiaoran and Sun, Mengyuan and Tang, Zinan and others},
  journal={arXiv preprint arXiv:2512.14051},
  year={2025}
}
Downloads last month
28
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GX-XinGao/Qwen3-8B-ODA-R-select-100k

Finetuned
(458)
this model

Dataset used to train GX-XinGao/Qwen3-8B-ODA-R-select-100k

Collection including GX-XinGao/Qwen3-8B-ODA-R-select-100k

Paper for GX-XinGao/Qwen3-8B-ODA-R-select-100k