Keywords: Large Language Models, Chain of Thought, Reinforcement Learning
Abstract: Recently, Chain-of-Thought (CoT) reasoning has become a key problem-solving
capability for advanced large language models (LLMs) to address difficult tasks
such as the mathematical ones. However, balancing the efficiency and performance of long CoTs still remains an intractable challenge. In this paper, we observe that assigning adaptive token budgets for different examples during training
is an viable way to tackle with the above issue. Motivated by this, we propose a novel reinforcement learning scheme, termed Adaptive-Budgeting based
Policy Optimization (ABPO). Based on the popular GRPO, our ABPO redefines
the RL training as an adaptive curriculum learning process, where example pools
are curated to categorize training examples into three types, namely the mastered,
learning and hard ones, respectively. As the training progresses, ABPO will adaptively schedule the examples with proper length budgets, and the example pools
will alse be dynamically updated based on the model status. In this way, we can
assign adaptive token lengths for different examples during RL training, achieving
a good balance between efficiency and performance of CoTs. To validate ABPO,
we apply it to three representative LLMs, and conduct extensive experiments on
a bunch of CoT reasoning benchmarks. The experimental results not only show
the substantial efficiency improvements with minimal performance loss, e.g., reducing token length by 78.3% while improving 2.0% performance of DeepSeek-R1-Distill-Qwen-1.5B on average, but also show our obvious advantages over
the compared methods, e.g., reducing 59.4% length and increasing 8.3% performance on average than HAPO, respectively. Our code is anonymously released at https://anonymous.4open.science/r/AnonymizeABPO-5380/
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 5814
Loading