Robobench: A Comprehensive Evaluation Benchmark For Multimodal Large Language Models as Embodied Brain

Yulin Luo; Chun-Kai Fan; Menghang Dong; Jiayu Shi; Mengdi Zhao; Bo-Wen Zhang; Jiaming Liu; Gaole Dai; Rongyu Zhang; Ruichuan An; Kun Wu; Zhengping Che; Pengwei Wang; Guang Liu; Zhongyuan Wang; Tiejun Huang; Cheng Chi; Shanghang Zhang

Robobench: A Comprehensive Evaluation Benchmark For Multimodal Large Language Models as Embodied Brain

Yulin Luo, Chun-Kai Fan, Menghang Dong, Jiayu Shi, Mengdi Zhao, Bo-Wen Zhang, Jiaming Liu, Gaole Dai, Rongyu Zhang, Ruichuan An, Kun Wu, Zhengping Che, Pengwei Wang, Guang Liu, Zhongyuan Wang, Tiejun Huang, Cheng Chi, Shanghang Zhang

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Embodied AI, Multimodal Large Language Model, Benchmark

Abstract: Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control. Systematic evaluation of System 2 is thus crucial for advancing embodied intelligence. Yet existing benchmarks emphasize execution success, or when focusing System 2, suffer from incomplete evaluation dimensions and limited task realism, offering only partial assessment of embodied cognition abilities. To bridge this gap, we introduce RoboBench, the first benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. RoboBench defines five critical dimensions—instruction comprehension, perception reasoning, generalized planning, affordance reasoning, and failure analysis—spanning 15 abilities, 26 tasks, and over 7,000 QA pairs. To ensure realism, we design task settings across diverse embodiments (single-arm, dual-arm, mobile manipulation), objects with rich physical and semantic attributes, multi-view scenes with occlusion and closed-loop feedback, sourced from large-scale real robotic datasets and curated in-house. For planning, RoboBench proposes a DAG-based evaluation framework capturing action–object dependencies and execution-order variations, enabling more faithful assessment of long-horizon reasoning than prior multiple-choice, BLEU, or generic LLM-based metrics. Experiments on 17 state-of-the-art MLLMs reveal fundamental limitations: difficulties with implicit instruction grounding, spatiotemporal reasoning, long-horizon and cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis. RoboBench provides a comprehensive scaffold to quantify embodied cognition, clarify System 2 performance, and guide the development of next-generation MLLMs toward more robust embodied intelligence.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 5379

Loading