MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning

Tajamul Ashraf; Umair Nawaz; Abdelrahman M Shaker; Rao Muhammad Anwer; Philip Torr; Fahad Shahbaz Khan; Salman Khan

MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning

Tajamul Ashraf, Umair Nawaz, Abdelrahman M Shaker, Rao Muhammad Anwer, Philip Torr, Fahad Shahbaz Khan, Salman Khan

18 Sept 2025 (modified: 17 Oct 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Agents, Vision-language Model, Tool usage

TL;DR: MATRIX is a vision-centric agent tuning framework that combines large-scale multimodal data synthesis with step-wise preference optimization, outperforming strong open- and closed-source baselines by up to 23% on Agent-X, GTA, and GAIA.

Abstract: Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-Trace, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-Trace for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, \ours consistently surpasses both open- and closed-source \vlms, demonstrating scalable and effective multimodal tool use. Our datasets and models will be open-sourced to support future research.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 11127

Loading