Keywords: Reasoning, Data, LLM
TL;DR: Data pipeline analysis for training reasoning models
Abstract: Reasoning models have made rapid progress on many benchmarks involving math,
code, and science. Yet, there are still many open questions about the best train-
ing recipes for reasoning since state-of-the-art models often rely on proprietary
datasets with little to no public information available. To address this, the goal of
the OpenThoughts project is to create open-source datasets for training reasoning
models. Our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model
trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard
reasoning benchmarks such as AIME and LiveCodeBench. We then improve
our dataset further by systematically investigating each step of our data genera-
tion pipeline with 1,000+ controlled experiments, which led to OpenThoughts3.
Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields
our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on
AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Dia-
mond – improvements of 15.3, 17.2, and 20.5 percentage points compared to the
DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on
openthoughts.ai.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5295
Loading