Keywords: llm, RL, tool use, auto think
Abstract: Tool use represents a critical capability for AI agents, with recent advances focusing on leveraging reinforcement learning (RL) for test-time scaling to achieve better performance through more deliberate reasoning.
However, there are some key challenges in current RL-based scaling approaches:
(a) direct RL training often struggles to scale up thinking length sufficiently to solve complex problems,
and (b) scaled-up models tend to overthink simpler problems, resulting in substantial token inefficiency.
To address these challenges, we propose a novel training paradigm that first employs warm-up supervised fine-tuning to help models distinguish between simple and complex problems, followed by RL that enable models to automatically determine appropriate reasoning trajectories.
Furthermore, to tackle the issue of automatic thinking-length scaling, we discover that entropy-based optimization objectives effectively maintain model diversity while successfully unlocking the model's scaling capabilities.
Based on this insight, we introduce an entropy-based long-short reasoning fusion RL strategy.
Our experiments on three benchmarks demonstrate that model successfully achieves auto-scaling for efficient tool use, achieving significant 9.8\% accuracy improvements while reducing computational overhead by ~81\%.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 12286
Loading