Base Models Know How to Reason, Thinking Models Learn When

18 Sept 2025 (modified: 11 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Chain of Thought, Reasoning models, Sparse Autoencoders, Steering
TL;DR: We show that rule-based RL training mainly teaches thinking models when to fire skills that exist already in base models
Abstract: Why do "thinking" language models like DeepSeek R1 outperform their base counterparts? Despite consistent performance gains, the underlying mechanism, i.e., what differs between base and thinking models, remains unclear. In this work, we show that, by deriving mechanism-specific steering vectors and activating the right vector at the right time in base models, we can elicit thinking-model-level reasoning chains. To ground our analysis, we introduce an unsupervised, bottom-up approach for uncovering human-interpretable reasoning mechanisms in thinking models. This approach provides a principled way to discover reasoning mechanisms without imposing manual or LLM-derived assumptions about what reasoning mechanisms should exist, reducing bias. Across three base and four thinking models, on GSM8K and MATH500, our combined approach substantially lifts base-model accuracy, recovering up to 91% of the gap to thinking models without any weight updates. This suggests that reinforcement learning with verifiable rewards mainly teaches thinking models when to fire pre-trained skills, rather than how to execute them, enabling the model to productively use its inference-time compute. Our work reframes our understanding of the performance gains of thinking models and may shed light on why distillation and reinforcement learning are so effective for teaching reasoning to small LLMs.
Primary Area: interpretability and explainable AI
Submission Number: 12244
Loading