Abstract: Training tool-augmented LLMs has emerged as a promising approach to enhancing language models' capabilities for complex tasks.
The current supervised fine-tuning paradigm relies on constructing extensive domain-specific datasets to train models.
However, this approach often struggles to generalize effectively to unfamiliar or intricate tool-use scenarios.
Recently, reinforcement learning (RL) paradigm can endow LLMs with superior reasoning and generalization abilities.
In this work, we address a key question: Can the pure RL be used to effectively elicit a model's intrinsic reasoning capabilities and enhance the tool-agnostic generalization?
We propose a dynamic generalization-guided reward design for rule-based RL, which progressively shifts rewards from exploratory to exploitative tool-use patterns.
Based on this design, we introduce the Tool-Zero series models.
These models are trained to enable LLMs to autonomously utilize general tools by directly scaling up RL from Zero models (i.e., base models without post-training).
Experimental results demonstrate that our models achieve over 7\% performance improvement compared to both SFT and RL-with-SFT models under the same experimental settings.
These gains are consistently replicated across cross-dataset and intra-dataset evaluations, validating the effectiveness and robustness of our methods.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: reinforcement learning, tool learning, LLM
Contribution Types: NLP engineering experiment, Reproduction study
Languages Studied: english
Keywords: LLM, tool use, preference learning
Submission Number: 4026
Loading