EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents

EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents

ACL ARR 2025 February Submission5584 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal Large Language Models (MLLMs) have shown significant advancements, providing a promising future for embodied agents. Existing benchmarks for evaluating MLLMs primarily utilize static images or videos, limiting assessments to non-interactive scenarios. Meanwhile, existing embodied benchmarks are task-specific and not diverse enough, which do not adequately evaluate the embodied capabilities of MLLMs. To address this, we propose EmbodiedEval, a challenging and benchmark to evaluate MLLMs' interactive capabilities in embodied tasks. EmbodiedEval provides a unified simulation and evaluation framework tailored for MLLMs. By rigorous selection and annotation, EmbodiedEval features 328 distinct tasks in five categories and constructs 125 varied 3D scenes. We evaluate the state-of-the-art MLLMs on EmbodiedEval and find that they have a significant shortfall compared to human level on embodied tasks. Our analysis demonstrates the limitations of existing MLLMs in embodied capabilities, providing insights for their future development.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality, benchmarking

Contribution Types: Data resources

Languages Studied: English

Submission Number: 5584

Loading