OpenThoughts: Data Recipes for Reasoning Models

Etash Kumar Guha; Ryan Marten; Sedrick Keh; Negin Raoof; Georgios Smyrnis; Hritik Bansal; Marianna Nezhurina; Jean Mercat; Trung Vu; Zayne Rea Sprague; Ashima Suvarna; Benjamin Feuer; Leon Liangyu Chen; Zaid Khan; Eric Frankel; Sachin Grover; Caroline Choi; Niklas Muennighoff; Shiye Su; Wanjia Zhao; John Yang; Shreyas Pimpalgaonkar; Kartik sharma; Charlie Cheng-Jie Ji; Yichuan Deng; Sarah M Pratt; Vivek Ramanujan; Jon Saad-Falcon; Stutee Acharya; Jeffrey Li; Achal Dave; Alon Albalak; Kushal Arora; Blake Wulfe; Chinmay Hegde; Greg Durrett; Sewoong Oh; Mohit Bansal; Saadia Gabriel; Aditya Grover; Kai-Wei Chang; Vaishaal Shankar; Aaron Gokaslan; Mike A Merrill; Tatsunori Hashimoto; Yejin Choi; Jenia Jitsev; Reinhard Heckel; Maheswaran Sathiamoorthy; Alex Dimakis; Ludwig Schmidt

OpenThoughts: Data Recipes for Reasoning Models

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reasoning, Data, LLM

TL;DR: Data pipeline analysis for training reasoning models

Abstract: Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best train- ing recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. Our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data genera- tion pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Dia- mond – improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on openthoughts.ai.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 5295

Loading