Towards Efficient Chain-of-Thought Reasoning via Adaptive-Budgeting based Policy Optimization

Wenhao Lin; Qiong Wu; Shuaixiang Yang; Yiyi Zhou; Xiaoshuai Sun; Rongrong Ji

Towards Efficient Chain-of-Thought Reasoning via Adaptive-Budgeting based Policy Optimization

Wenhao Lin, Qiong Wu, Shuaixiang Yang, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji

15 Sept 2025 (modified: 02 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Chain of Thought, Reinforcement Learning

Abstract: Recently, Chain-of-Thought (CoT) reasoning has become a key problem-solving capability for advanced large language models (LLMs) to address difficult tasks such as the mathematical ones. However, balancing the efficiency and performance of long CoTs still remains an intractable challenge. In this paper, we observe that assigning adaptive token budgets for different examples during training is an viable way to tackle with the above issue. Motivated by this, we propose a novel reinforcement learning scheme, termed Adaptive-Budgeting based Policy Optimization (ABPO). Based on the popular GRPO, our ABPO redefines the RL training as an adaptive curriculum learning process, where example pools are curated to categorize training examples into three types, namely the mastered, learning and hard ones, respectively. As the training progresses, ABPO will adaptively schedule the examples with proper length budgets, and the example pools will alse be dynamically updated based on the model status. In this way, we can assign adaptive token lengths for different examples during RL training, achieving a good balance between efficiency and performance of CoTs. To validate ABPO, we apply it to three representative LLMs, and conduct extensive experiments on a bunch of CoT reasoning benchmarks. The experimental results not only show the substantial efficiency improvements with minimal performance loss, e.g., reducing token length by 78.3% while improving 2.0% performance of DeepSeek-R1-Distill-Qwen-1.5B on average, but also show our obvious advantages over the compared methods, e.g., reducing 59.4% length and increasing 8.3% performance on average than HAPO, respectively. Our code is anonymously released at https://anonymous.4open.science/r/AnonymizeABPO-5380/

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 5814

Loading