SteadyThought: Mitigating LLM Under-Thinking via Thought-Level Preference Optimization

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Reasoning, Underthinking
Abstract: Flexible switching between reasoning trajectories (i.e., thoughts switching) has significantly enhanced the reasoning capabilities of Large Reasoning Models (LRMs). However, existing models often switch excessively yet fail to sustain promising reasoning thoughts——a phenomenon termed “under-thinking”. While recent efforts suppress switching to mitigate this, such over-correction may discard valuable trajectories. To address this challenge, we propose Steady Thought (ST), a novel thought-level preference optimization framework. ST first segments model responses into thought sequences then guides the model to complete reasoning from these thoughts without further switching, generating coherent trajectories.Finally, ST performs thought-level preference optimization by treating the newly generated response as preferred and the original one as dis-preferred. Experiments across multiple models and datasets show that ST effectively mitigates under-thinking. It reduces output length by up to 39.3% while improving accuracy by up to 5.3%, with strong generalization. Further analysis confirms that ST leads to more rational switching and deeper exploration of solution thoughts.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23998
Loading