Towards Efficient Chain-of-Thought Reasoning via Adaptive-Budgeting based Policy Optimization

Towards Efficient Chain-of-Thought Reasoning via Adaptive-Budgeting based Policy Optimization

ACL ARR 2026 January Submission983 Authors

26 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Chain of Thought, Reinforcement Learning

Abstract: Chain-of-Thought (CoT) reasoning has become a key problem-solving capability for advanced large language models (LLMs), but the "over-thinking" issue still plagues LLMs. In this paper, we propose a novel learning scheme termed Adaptive-Budgeting based Policy Optimization (ABPO) to balance the performance and efficiency of CoT reasoning. Our ABPO defines the RL training as an adaptive curriculum learning process, where example pools are curated to categorize training examples into three types, namely the mastered, learning and hard ones, respectively. As the training progresses, ABPO will adaptively schedule the examples with proper length budgets, and the example pools will also be dynamically updated based on the model status, thereby achieving a good balance between efficiency and performance of CoTs. The experimental results not only show the substantial efficiency improvements by ABPO, e.g., reducing token length by 78.3% while improving 2.0% performance of DeepSeek-R1-Distill-Qwen-1.5B on average, but also show its obvious advantages over the compared methods, e.g., reducing 59.4% length and increasing 8.3% performance on average than HAPO, respectively. Our code is anonymously released at https://anonymous.4open.science/r/AnonymizeABPO-5380/

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: LLM Efficiency

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 983

Loading