Keywords: Multimodal Agents, Vision-language Model, Tool usage
TL;DR: MATRIX is a vision-centric agent tuning framework that combines large-scale multimodal data synthesis with step-wise preference optimization, outperforming strong open- and closed-source baselines by up to 23% on Agent-X, GTA, and GAIA.
Abstract: Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation.
We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-Trace, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-Trace for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, \ours consistently surpasses both open- and closed-source \vlms, demonstrating scalable and effective multimodal tool use. Our datasets and models will be open-sourced to support future research.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11127
Loading