Keywords: Reasoning model; Large Language Model; Overthinking
TL;DR: Helping models learn appropriate reasoning lengths to improve performance
Abstract: Recent advances in chain-of-thought (CoT) and post-training have improved LLMs’ reasoning abilities, but often at the cost of generating redundant steps, leading to wasted computation and increased latency in real-time applications. Existing reinforcement learning
(RL) approaches attempt to condense CoT by rewarding brevity, but they fall short in two key aspects: (1) For highly difficult queries, they waste tokens on hopeless reasoning attempts; (2) For medium-difficulty queries, models either stop too soon and miss the answer, or continue beyond the correct answer and introduce errors. To address these issues, we propose RazorReward—a novel reward scheme that sharply differentiates optimal from suboptimal reasoning. For hard queries, RazorReward penalizes unnecessary CoT steps
and encourages abstention when no solution is possible. For medium-difficulty queries, it rewards only reasoning paths that match the minimal sufficient CoT steps, heavily penalizing both under- and over-reasoning. Building on this, we introduce RazorReward-RL, a novel RL framework that segments CoT into semantically meaningful blocks, enabling more precise early stopping and targeted reward allocation. Extensive experiments on six reasoning benchmarks show that RazorReward-RL consistently outperforms previous methods, boosting accuracy by 8.3%–9.3% while reducing average token usage by 38.4%–43.8%, thus achieving a better balance between accuracy and efficiency
Primary Area: foundation or frontier models, including LLMs
Submission Number: 8660
Loading