Keywords: Reasoning, LLM, Data
TL;DR: Analysis for finding the best recipe for generating postraining reasoning data for training LLMs.
Abstract: Reasoning models have made rapid progress on many benchmarks involving math,
code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary
datasets with little to no public information available. To address this, the goal
of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led
to OpenThinker2-32B, the first model trained on public reasoning data to match
DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and
LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments,
which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using
QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-ofthe-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and
54% on GPQA Diamond – improvements of 15.3, 17.2, and 20.5 percentage points
compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models
are available on REDACTED.
Submission Number: 134
Loading