VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Xiao Liu; Tianjie Zhang; Yu Gu; Iat Long Iong; Song XiXuan; Yifan Xu; Shudan Zhang; Hanyu Lai; Jiadai Sun; Xinyue Yang; Yu Yang; Zehan Qi; Shuntian Yao; Xueqiao Sun; Siyi Cheng; Qinkai Zheng; Hao Yu; Hanchen Zhang; Wenyi Hong; Ming Ding; Lihang Pan; Xiaotao Gu; Aohan Zeng; Zhengxiao Du; Chan Hee Song; Yu Su; Yuxiao Dong; Jie Tang

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Published: 22 Jan 2025, Last Modified: 03 Mar 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Multimodal Models, Agents, Evaluation

Abstract: Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable \textbf{Visual Foundation Agents} that are postulated to excel across a myriad of tasks. However, existing benchmarks fail to sufficiently challenge or showcase the full potential of LMMs as visual foundation agents in complex, real-world environments. To address this gap, we introduce VisualAgentBench (VAB), a comprehensive and unified benchmark specifically designed to train and evaluate LMMs as visual foundation agents across diverse scenarios in one standard setting, including Embodied, Graphical User Interface, and Visual Design, with tasks formulated to probe the depth of LMMs' understanding and interaction capabilities. Through rigorous testing across 9 proprietary LMM APIs and 9 open models (18 in total), we demonstrate the considerable yet still developing visual agent capabilities of these models. Additionally, VAB explores the synthesizing of visual agent trajectory data through hybrid methods including Program-based Solvers, LMM Agent Bootstrapping, and Human Demonstrations, offering insights into obstacles, solutions, and trade-offs one may meet in developing open LMM agents. Our work not only aims to benchmark existing models but also provides an instrumental playground for future development into visual foundation agents. Code, train, and test data are available at \url{https://github.com/THUDM/VisualAgentBench}.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6640

Loading

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Song XiXuan, Yifan Xu, Shudan Zhang, Hanyu Lai, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding et al. (8 additional authors not shown)