Keywords: multimodality, benchmarking
Abstract: Multimodal Large Language Models (MLLMs) have shown significant advancements, providing a promising future for embodied agents. Existing benchmarks for evaluating MLLMs primarily utilize static images or videos, limiting assessments to non-interactive scenarios. Meanwhile, existing embodied benchmarks are task-specific and not diverse enough, which do not adequately evaluate the embodied capabilities of MLLMs. To address this, we propose EmbodiedEval, a challenging and benchmark to evaluate MLLMs' interactive capabilities in embodied tasks. EmbodiedEval provides a unified simulation and evaluation framework tailored for MLLMs. By rigorous selection and annotation, EmbodiedEval features 328 distinct tasks in five categories and constructs 125 varied 3D scenes. We evaluate the state-of-the-art MLLMs on EmbodiedEval and find that they have a significant shortfall compared to human level on embodied tasks. Our analysis demonstrates the limitations of existing MLLMs in embodied capabilities, providing insights for their future development.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 11
Loading