Keywords: Large Language Models, Chain of Thought, Reinforcement Learning
Abstract: Chain-of-Thought (CoT) reasoning has become a key problem-solving capability for advanced large language models (LLMs), but the "over-thinking" issue still plagues LLMs. In this paper, we propose a novel learning scheme termed Adaptive-Budgeting based Policy Optimization (ABPO) to balance the performance and efficiency of CoT reasoning. Our ABPO defines the RL training as an adaptive curriculum learning process, where example pools are curated to categorize training examples into three types, namely the mastered, learning and hard ones, respectively. As the training progresses, ABPO will adaptively schedule the examples with proper length budgets, and the example pools will also be dynamically updated based on the model status, thereby achieving a good balance between efficiency and performance of CoTs. The experimental results not only show the substantial efficiency improvements by ABPO, e.g., reducing token length by 78.3% while improving 2.0% performance of DeepSeek-R1-Distill-Qwen-1.5B on average, but also show its obvious advantages over the compared methods, e.g., reducing 59.4% length and increasing 8.3% performance on average than HAPO, respectively. Our code is anonymously released at https://anonymous.4open.science/r/AnonymizeABPO-5380/
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: LLM Efficiency
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 983
Loading