Robobench: A Comprehensive Evaluation Benchmark For Multimodal Large Language Models as Embodied Brain
Keywords: Embodied AI, Multimodal Large Language Model, Benchmark
Abstract: Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control.
Systematic evaluation of System 2 is thus crucial for advancing embodied intelligence.
Yet existing benchmarks emphasize execution success, or when focusing System 2, suffer from incomplete evaluation dimensions and limited task realism, offering only partial assessment of embodied cognition abilities.
To bridge this gap, we introduce RoboBench, the first benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. RoboBench defines five critical dimensions—instruction comprehension, perception reasoning, generalized planning, affordance reasoning, and failure analysis—spanning 15 abilities, 26 tasks, and over 7,000 QA pairs. To ensure realism, we design task settings across diverse embodiments (single-arm, dual-arm, mobile manipulation), objects with rich physical and semantic attributes, multi-view scenes with occlusion and closed-loop feedback, sourced from large-scale real robotic datasets and curated in-house. For planning, RoboBench proposes a DAG-based evaluation framework capturing action–object dependencies and execution-order variations, enabling more faithful assessment of long-horizon reasoning than prior multiple-choice, BLEU, or generic LLM-based metrics.
Experiments on 17 state-of-the-art MLLMs reveal fundamental limitations: difficulties with implicit instruction grounding, spatiotemporal reasoning, long-horizon and cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis.
RoboBench provides a comprehensive scaffold to quantify embodied cognition, clarify System 2 performance, and guide the development of next-generation MLLMs toward more robust embodied intelligence.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 5379
Loading