The Green Reasoner: Cognition-Guided Policy Optimization for Efficient LLM Reasoning

The Green Reasoner: Cognition-Guided Policy Optimization for Efficient LLM Reasoning

ACL ARR 2026 January Submission220 Authors

22 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient Reasoning, Reinforcement Learning, Large Language Models

Abstract: Balancing reasoning accuracy and efficiency in large language models (LLMs) is a critical objective. Reinforcement learning (RL) has emerged as a leading paradigm to achieve this goal. However, current RL-based methods cannot effectively distinguish between redundant and essential reasoning paths. Moreover, these methods often lack flexibility in handling samples with different difficulty levels. To address these limitations, we present Cognition-Guided Policy Optimization (CGPO), which consists of Cognitive Utility Reward (CUR) and Cognition-Adaptive Regulation (CAR). Specifically, CUR designs a multiplicative reward that scales correctness rewards with a non-linear length penalty to reduce redundancy. CAR adaptively adjusts Kullback-Leibler (KL) regularization based on real-time cognitive difficulty of each sample. Extensive experiments on nine datasets across three reasoning tasks demonstrate that CGPO achieves an effective balance between efficiency and reasoning accuracy. For instance, on mathematical reasoning benchmarks using the DeepScaleR-Preview-1.5B model, CGPO outperforms other methods by 0.2 to 3.2 points in average Pass@1 and reduces token usage by 0.7\% to 38.4\%.

Paper Type: Long

Research Area: Mathematical, Symbolic, Neurosymbolic, and Logical Reasoning

Research Area Keywords: Mathematical reasoning

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 220

Loading