OpenThoughts: Data Recipes for Reasoning Models

Etash Kumar Guha; Ryan Marten; Sedrick Keh; Negin Raoof; Georgios Smyrnis; Hritik Bansal; Marianna Nezhurina; Jean Mercat; Trung Vu; Zayne Rea Sprague; Ashima Suvarna; Benjamin Feuer; Leon Liangyu Chen; Zaid Khan; Eric Frankel; Sachin Grover; Caroline Choi; Niklas Muennighoff; Shiye Su; Wanjia Zhao; John Yang; Shreyas Pimpalgaonkar; Kartik sharma; Charlie Cheng-Jie Ji; Yichuan Deng; Sarah M Pratt; Vivek Ramanujan; Jon Saad-Falcon; Jeffrey Li; Achal Dave; Alon Albalak; Kushal Arora; Blake Wulfe; Chinmay Hegde; Greg Durrett; Sewoong Oh; Mohit Bansal; Saadia Gabriel; Aditya Grover; Kai-Wei Chang; Vaishaal Shankar; Aaron Gokaslan; Mike A Merrill; Tatsunori Hashimoto; Yejin Choi; Jenia Jitsev; Reinhard Heckel; Maheswaran Sathiamoorthy; Alex Dimakis; Ludwig Schmidt

OpenThoughts: Data Recipes for Reasoning Models

Published: 23 Sept 2025, Last Modified: 07 Dec 2025FoRLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reasoning, LLM, Data

TL;DR: Analysis for finding the best recipe for generating postraining reasoning data for training LLMs.

Abstract: Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-ofthe-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond – improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on REDACTED.

Submission Number: 134

Loading