Abstract: Multimodal Large Language Models (MLLMs) have shown significant advancements, providing a promising future for embodied agents.
Existing benchmarks for evaluating MLLMs primarily utilize static images or videos, limiting assessments to non-interactive scenarios.
Meanwhile, existing embodied benchmarks are task-specific and not diverse enough, which do not adequately evaluate the embodied capabilities of MLLMs. To address this, we propose EmbodiedEval, a challenging and benchmark to evaluate MLLMs' interactive capabilities in embodied tasks. EmbodiedEval provides a unified simulation and evaluation framework tailored for MLLMs. By rigorous selection and annotation, EmbodiedEval features 328 distinct tasks in five categories and constructs 125 varied 3D scenes. We evaluate the state-of-the-art MLLMs on EmbodiedEval and find that they have a significant shortfall compared to human level on embodied tasks. Our analysis demonstrates the limitations of existing MLLMs in embodied capabilities, providing insights for their future development.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, benchmarking
Contribution Types: Data resources
Languages Studied: English
Submission Number: 5584
Loading