Guided by Experts, Empowered by Self-Exploration: A Unified Post-Training Methodology for Agentic Tool Use

ACL ARR 2026 January Submission5001 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Tool Use Learning, Reinforcement Learning, Post-Training
Abstract: Post-training is crucial for enhancing tool-use capabilities in large language models (LLMs). However, the dominant paradigms face key limitations: Supervised Fine-Tuning (SFT) exhibits poor generalization, Reinforcement Learning (RL) suffers from sparse or coarse-grained rewards, and SFT-then-RL pipeline fails to balance supervised imitation with self-exploration. We propose $\textbf{ToolUPT}$, a $\textbf{Tool}$-Use $\textbf{U}$nified $\textbf{P}$ost-$\textbf{T}$raining methodology that synergistically integrates SFT and RL through dynamic token-wise weighting. To address credit assignment problem in sequence-level RL formulations, we introduce $\textbf{Fine-Grained Advantage Estimation}$ with token-level rewards integrating outcome and process signals. To stabilize training and improve generalization, we propose $\textbf{Selective Preference Optimization}$, which incorporates expert supervision on curated tokens while teaching the model to recognize and avoid rejected tool use through negative advantages. Extensive experiments on BFCL-v3 across four base models show that ToolUPT significantly outperforms pure RL by $\textbf{8.43\%}$, SFT-then-RL by $\textbf{3.58\%}$, and previous methods combining expert demonstrations with self-exploration by $\textbf{4.83\%}$, achieving state-of-the-art performance in tool-use post-training.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: Reinforcement Learning,LLM/AI agents,fine-tuning
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 5001
Loading