Keywords: embodied reasoning, multimodal large model
TL;DR: the first unified mllm embodied reasoning benchmark
Abstract: Embodied reasoning abilities refer to the capabilities for agents to perceive, comprehend, and interact effectively with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied reasoning capabilities remains underexplored, as existing benchmarks primarily focus on isolated domains such as planning or spatial understanding. To bridge this gap, we propose BEAR, a comprehensive and fine-grained benchmark designed to evaluate MLLM's atomic embodied reasoning abilities. BEAR comprises 4,469 interleaved video–image–text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning. Evaluation results of 15 state-of-the-art MLLMs reveal their persistent limitations across all domains of embodied reasoning.
Submission Number: 179
Loading