Self-Guided Thinking: Enabling LLMs to Decide When to Think

20 Sept 2025 (modified: 20 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language models, Reinforcement learning, reasoning, Large reasoning models, Test-time compute
TL;DR: A framework to train a single LLM to decide when to think across varied domains.
Abstract: Large reasoning models improve performance on complex tasks by generating extended thought processes, but applying this approach uniformly to general user queries is computationally wasteful. Current solutions require complex multi-model systems or burden the user with manual controls. To address this, we introduce Self-Guided Thinking (SGT), a framework that enables a single model to learn to decide for itself when to think. SGT seamlessly integrates a lightweight penalty for deliberation into the Direct Preference Optimization (DPO) objective during the general alignment phase, teaching the model to balance performance with computational cost. Our experiments show that SGT learns a sophisticated, domain-adaptive policy. It achieves near-peak performance on general benchmarks while significantly reducing unnecessary thinking, and generalizes effectively to challenging out-of-distribution tasks by increasing its thinking where needed. On verifiable benchmarks, we find that while SGT preserves the model's reasoning capabilities, the general alignment stage does not substantially improve them over a fine-tuned baseline, suggesting the need for targeted in-domain training for further gains. Our ablations reveal that SGT teaches the model when to deploy a pre-existing capability, not how to reason from scratch; the policy’s effectiveness is contingent on foundational knowledge from prior SFT and sufficient response length. Together, these findings demonstrate that an autonomous reasoning policy can be learned efficiently during general alignment, offering a practical path to deploy more economical and versatile models.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23469
Loading