Beyond Fixed Budgets: Dynamic Reasoning Efficiency Reward for Large Language Model

08 Sept 2025 (modified: 04 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Overthinking; LLM post-training; Efficient CoT
Abstract: The "slow thinking" paradigm has been widely validated to enhance the reasoning capabilities of large language models, but it also introduces reasoning inefficiency: models may overthink simple problems while prematurely shifting their reasoning paths when tackling complex problems. To address this, we propose AdapThink, a simple yet efficient post-training framework designed to control preferences for ``slow thinking'' pattern adaptively. Unlike directly imposing length budgets or setting overlong filters, AdapThink leverages group-level length distributions and reflective word distributions to construct reasoning process rewards and introduce a two-stage sampling strategy aimed at maximizing group diversity. Experimental results demonstrate that when post-training two DeepSeek-distilled Qwen models under a context length limit of only 2K tokens, AdapThink achieves a 27\% improvement in convergence rewards compared to the GRPO baseline. Notably, when testing models with a 32K token limit, AdapThink also achieves 12.6\% improvement over base model in several mathematical benchmarks.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 2896
Loading