Keywords: Agent, Multimodal Learning, Tool Learning
Abstract: The advancement of large language models (LLMs) prompts the development of multi-modal agents, providing a feasible way to solve practical tasks by using tools. In this paper, we propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data and tunes a vision-language model (VLM) as the controller for powerful tool-usage reasoning. To preserve the data quality, we prompt the GPT-4o model to separately generate queries, files, and trajectories, followed by a query-file verifier and trajectory verifier. Based on the data synthesis pipeline, we collect the MM-traj dataset with 20k tasks using 10 tools. Then, we build the T3-agent that uses MiniCPM-V as the controller Trajectory Tuning for Tool usage using MM-Traj. Evaluations on the GTA and GAIA benchmarks show that the T3-agent has achieved remarkable improvements and outperforms GPT-4 driven agents by 10%, showing the effectiveness of the proposed data synthesis pipeline that leads to better reasoning capabilities in tool usage.
Submission Number: 93
Loading