Surgical Trimming: Minimal Sufficient Chain of Thought with RazorReward-RL

Diefan Lin; Kai Sun; Wenwen Ye; Bin Shi; Changying Hao; Bo Dong; Dawei Yin

Surgical Trimming: Minimal Sufficient Chain of Thought with RazorReward-RL

Diefan Lin, Kai Sun, Wenwen Ye, Bin Shi, Changying Hao, Bo Dong, Dawei Yin

17 Sept 2025 (modified: 04 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reasoning model; Large Language Model; Overthinking

TL;DR: Helping models learn appropriate reasoning lengths to improve performance

Abstract: Recent advances in chain-of-thought (CoT) and post-training have improved LLMs’ reasoning abilities, but often at the cost of generating redundant steps, leading to wasted computation and increased latency in real-time applications. Existing reinforcement learning (RL) approaches attempt to condense CoT by rewarding brevity, but they fall short in two key aspects: (1) For highly difficult queries, they waste tokens on hopeless reasoning attempts; (2) For medium-difficulty queries, models either stop too soon and miss the answer, or continue beyond the correct answer and introduce errors. To address these issues, we propose RazorReward—a novel reward scheme that sharply differentiates optimal from suboptimal reasoning. For hard queries, RazorReward penalizes unnecessary CoT steps and encourages abstention when no solution is possible. For medium-difficulty queries, it rewards only reasoning paths that match the minimal sufficient CoT steps, heavily penalizing both under- and over-reasoning. Building on this, we introduce RazorReward-RL, a novel RL framework that segments CoT into semantically meaningful blocks, enabling more precise early stopping and targeted reward allocation. Extensive experiments on six reasoning benchmarks show that RazorReward-RL consistently outperforms previous methods, boosting accuracy by 8.3%–9.3% while reducing average token usage by 38.4%–43.8%, thus achieving a better balance between accuracy and efficiency

Primary Area: foundation or frontier models, including LLMs

Submission Number: 8660

Loading