Keywords: Interactive Tool-Use Agents, Multimodal Agents, Tool-Integrated Reinforcement Learning
Abstract: Effective interactive tool use requires agents to master Tool Integrated Reasoning: a complex process involving multi-turn planning and long-context dialogue management. To train agents for this dynamic process, particularly in multimodal contexts, we introduce a sandbox environment for reinforcement learning (RL) that supports tool calling and speech-based user simulation. Our core strategy, Turn-level Adjudicated Reinforcement Learning (TARL), addresses the challenge of credit assignment in long-horizon tasks by employing a Large Language Model (LLM) as a judge to provide turn-level evaluation. To enhance exploration, we integrate a mixed-task training curriculum with mathematical reasoning problems. This unified approach boosts the task pass rate on the text-based $\tau$-bench by over 6% compared to strong RL baselines. Moreover, we demonstrate our framework's suitability for fine-tuning a multimodal LLM for agentic tasks. By training a base multimodal LLM on interleaved speech-text rollouts, we equip it with tool-use abilities, paving the way for more natural, voice-driven interactive agents.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8090
Loading