Reflective Reinforcement Tool Learning

Haotian Chen; Boye Niu; ZHANG Ke; Zijun Song; Yaxi Lu; Zhong Zhang; Xin Cong; Longfei Li; JUN ZHOU; Yankai Lin; Zhiyuan Liu; Maosong Sun

Reflective Reinforcement Tool Learning

Haotian Chen, Boye Niu, ZHANG Ke, Zijun Song, Yaxi Lu, Zhong Zhang, Xin Cong, Longfei Li, JUN ZHOU, Yankai Lin, Zhiyuan Liu, Maosong Sun

11 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, tool learning

Abstract: Tool learning enables large language models (LLMs) to interact with real-world environments. While prior work mainly relies on supervised fine-tuning (SFT), recent reinforcement learning (RL) methods have shown promise in improving the tool-use capabilities of LLMs by leveraging richer reward signals. However, during RL rollouts, failures often stem from environmental perturbations such as network issues or tool instability rather than policy errors. These failed trajectories are typically discarded, resulting in low data efficiency and high costs, especially when using paid tools. To solve the issue, we find that many failures can be recovered through simple retries, reasoning, or reflection. Yet these augmented new policies for self-correction introduce distribution shifts that hinder the reuse of recovered data for origin policy learning. In this paper, we propose Tool-Reflective Reinforcement Learning (Tool-ReRL), an off-policy RL framework that equips LLMs with a reflection mechanism to temporarily adjust the rollout policy, thus analyzing failures, attempting self-correction, and exploring diverse solution paths. To bridge the distribution gap between modified and original policy, we introduce an importance sampling estimator, enabling rewards from reflection-enhanced trajectories to effectively guide the optimization of the original policy. Our extensive experiments on four tool-learning benchmarks demonstrate that, given the same training data, Tool-ReRL significantly improves data efficiency and achieves average performance gains of up to 7.60% and 6.11% over standard RL algorithms based on Qwen2.5-7B and LLaMA3.1-8B, respectively.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 3924

Loading