Abstract: Scaling inference compute has become a key driver of advanced reasoning in large language models (LLMs). A proven approach for scaling inference compute is to generate long chains-of-thought (CoTs), enabling models to engage in structured reasoning strategies such as backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the underlying *mechanics of long CoT reasoning*—examining the factors that enable models to generate extended reasoning trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we identify three key findings: 1) while SFT is not strictly necessary, it significantly simplifies training and improves efficiency; 2) reasoning capabilities tend to emerge with increased training compute but are not guaranteed, making reward shaping essential for stabilizing CoT length growth; and 3) scaling verifiable reward signals is critical for RL, and we find that leveraging noisy, web-extracted solutions with filtering mechanisms shows promising potential, particularly in out-of-distribution (OOD) reasoning tasks such as STEM problem-solving. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs.
Lay Summary: Large language models (LLMs) have made great progress in reasoning, especially with recent breakthroughs like OpenAI’s o1 and Deepseek R1's model. These models can now solve much harder problems — from advanced math to software engineering — by thinking in longer, more structured ways. They don’t just give answers; they reflect on their reasoning, correct mistakes, and explore different solution paths.
In our work, we study how to train LLMs to reason this way. First, we show that teaching models with examples of long, step-by-step reasoning helps them reach higher performance and makes further training more effective. Second, we find that traditional training methods often struggle to extend reasoning in a stable way — so we design new rewards that encourage deeper thinking without the model becoming repetitive. Lastly, we explore using large but noisy datasets from the web to train models, and show that, with the right techniques, this “imperfect” data can still help models tackle unfamiliar, challenging tasks in science and engineering.
Link To Code: https://github.com/eddycmu/demystify-long-cot
Primary Area: Deep Learning->Large Language Models
Keywords: Reinforcement Learning, Reasoning, Math, Chain-of-Thought, CoT, Supervised Fine-tuning, Reward Design
Submission Number: 7623
Loading