RIRM: Reflection Inhibition Reward Mechanism for Mitigating Overthinking in Large Reasoning Models

RIRM: Reflection Inhibition Reward Mechanism for Mitigating Overthinking in Large Reasoning Models

ACL ARR 2026 January Submission9572 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reflection Inhibition Reward Mechanism, Overthinking, Redundant Post-Answer Reflection, Large Reasoning Models, Chain-of-Thought, Test-Time Scaling, Reinforcement Learning, Trajectory-Level Reward, Token Efficiency, Efficient Reasoning

Abstract: Large Reasoning Models (LRMs) achieve significant performance gains across various domains by extending the Chain-of-Thought (CoT) length during inference. However, they frequently exhibit a tendency toward overthinking—generating unnecessary reasoning steps after already reaching a correct answer—which undermines both efficiency and performance. To address this issue, we propose the Reflection Inhibition Reward Mechanism (RIRM), a reinforcement learning-based method designed to suppress excessive post-answer reflection. Specifically, RIRM identifies the position of the first correct answer and subsequent reflections within the CoT. It then optimizes a well-designed reward function that guides reinforcement learning toward accurate yet computationally efficient reasoning. Extensive experiments on mathematical and scientific benchmarks show that RIRM reduces token consumption by up to 69.06% while improving accuracy by up to 11.37 percentage points, yielding a more favorable efficiency–accuracy tradeoff. Code will be released shortly.

Paper Type: Long

Research Area: Language Models

Research Area Keywords: AI/LLM Agents, Efficient/Low-Resource Methods for NLP, Generation, Language Modeling, Machine Learning for NLP

Languages Studied: English, Chinese

Submission Number: 9572

Loading