Abstract: Recently, agents based on Multimodal Large Language Models (MLLMs) have emerged as a promising area of research. However, there is a notable absence of effective benchmarks for training and evaluating these MLLM-based agents. In this paper, we propose a new benchmark named PCA-Bench for MLLMs, comprising 1) PCA-Eval, a novel automatic evaluation metric inspired by the perception-action loop in cognitive science, assessing the decision-making ability of MLLMs from the perspectives of Perception, Cognition, and Action. 2) Embodied-Instruction-Evolution (EIE), an automatic framework for synthesizing instruction tuning examples in various multi-modal embodied environments. 3) A dataset comprising 7,510 training and 813 test examples across three domains: autonomous driving, domestic robotics, and open-world gaming. Our experiments demonstrate that visual perception and reasoning with world knowledge are two core abilities for an agent to make correct actions. Advanced MLLMs like GPT-4 Vision exhibit superior performance than its open-source counterpart. Additionally, our EIE method substantially enhances open-source MLLMs' performance, at times even surpassing GPT-4 Vision in certain sub-scores. We believe PCA-Bench serves as an effective bridge between MLLMs and their application in embodied agents. The benchmark will be made open-source.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Data resources
Languages Studied: English
0 Replies
Loading