Keywords: Efficient Reasoning, Reinforcement Learning, Large Language Models
Abstract: Balancing reasoning accuracy and efficiency in large language models (LLMs) is a critical objective. Reinforcement learning (RL) has emerged as a leading paradigm to achieve this goal. However, current RL-based methods cannot effectively distinguish between redundant and essential reasoning paths. Moreover, these methods often lack flexibility in handling samples with different difficulty levels. To address these limitations, we present Cognition-Guided Policy Optimization (CGPO), which consists of Cognitive Utility Reward (CUR) and Cognition-Adaptive Regulation (CAR). Specifically, CUR designs a multiplicative reward that scales correctness rewards with a non-linear length penalty to reduce redundancy. CAR adaptively adjusts Kullback-Leibler (KL) regularization based on real-time cognitive difficulty of each sample. Extensive experiments on nine datasets across three reasoning tasks demonstrate that CGPO achieves an effective balance between efficiency and reasoning accuracy. For instance, on mathematical reasoning benchmarks using the DeepScaleR-Preview-1.5B model, CGPO outperforms other methods by 0.2 to 3.2 points in average Pass@1 and reduces token usage by 0.7\% to 38.4\%.
Paper Type: Long
Research Area: Mathematical, Symbolic, Neurosymbolic, and Logical Reasoning
Research Area Keywords: Mathematical reasoning
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 220
Loading