EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Rui Yang; Hanyang Chen; Junyu Zhang; Mark Zhao; Cheng Qian; Kangrui Wang; Qineng Wang; Teja Venkat Koripella; Marziyeh Movahedi; Manling Li; Heng Ji; Huan Zhang; Tong Zhang

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduces EmbodiedBench, a benchmark designed to evaluate the finegrained capabilities of vision-driven embodied agents.

Abstract: Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning. Through extensive experiments, we evaluated 24 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only $28.9\\%$ on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code and dataset are available at [https://embodiedbench.github.io](https://embodiedbench.github.io).

Lay Summary: EmbodiedBench is a new tool that helps researchers test how well advanced AI systems—ones that can understand both pictures and language—perform tasks in virtual environments. These AI systems, called Multimodal Large Language Models (MLLMs), are developed to help robots and digital assistants better understand and interact with the world around them. The benchmark includes over 1,100 tasks in different simulated settings. These tasks range from moving through a room to more complex ones like organizing household items. They test important abilities such as following instructions, recognizing objects, making plans, and adjusting to changes. When researchers tested 24 top AI models, they found that while many models are good at big-picture planning, they still struggle with detailed, hands-on actions. For example, even the most advanced model, GPT-4o, completed tasks successfully only 28.9% of the time on average. EmbodiedBench provides a standard way to compare different AI systems and understand where they need to improve. By pointing out these weaknesses, it helps guide the development of smarter, more reliable AI agents in the future.

Link To Code: https://github.com/EmbodiedBench/EmbodiedBench

Primary Area: Deep Learning->Large Language Models

Keywords: Embodied Agent, Multi-modal Large Language Models

Submission Number: 13392

Loading