11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis

Chengzu Li; Wenshan Wu; Huanyu Zhang; Qingtao Li; Zeyu Gao; Yan Xia; Jose Hernandez-Orallo; Ivan Vulić; Furu Wei

11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis

Chengzu Li, Wenshan Wu, Huanyu Zhang, Qingtao Li, Zeyu Gao, Yan Xia, Jose Hernandez-Orallo, Ivan Vulić, Furu Wei

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Models, Evaluation, Spatial Reasoning

Abstract: For human cognitive process, spatial reasoning and perception are closely entangled, yet the nature of this interplay remains underexplored in the evaluation of multimodal large language models (MLLMs). While recent MLLM advancements show impressive performance on reasoning, their capacity for human-like spatial cognition remains an open question. In this work, we introduce a systematic evaluation framework to assess the spatial reasoning abilities of state-of-the-art MLLMs relative to human performance. Central to our work is 11Plus-Bench, a high-quality benchmark derived from realistic standardized spatial aptitude tests. 11Plus-Bench also features fine-grained human expert annotations of both perceptual complexity and reasoning process. These annotations allow us to move beyond aggregated accuracy, enabling an instance-level, parallel analysis of human and machine cognitive profiles with predictive power. Through extensive experiments across 14 MLLMs and human evaluation, we find that current MLLMs exhibit early signs of spatial cognition. We find both convergence and divergence: while both human and MLLM cognitive effort, measured by response time and tokens generated respectively, correlates with reasoning complexity, their underlying mechanisms differ. Human correctness is highly predictable and shaped by abstract pattern complexity, whereas instance-level MLLM performance remains weakly predictable and sensitive to low-level perceptual features. Our work provides a precise characterization of the emerging yet brittle spatial reasoning in MLLMs, offering actionable insights for developing more human-like spatial intelligence.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 4468

Loading