Mind the Gap: Diagnosing Spatial Reasoning Failures in Vision-Language Models

Mind the Gap: Diagnosing Spatial Reasoning Failures in Vision-Language Models

ICLR 2026 Conference Submission19475 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models, VLMs, spatial reasoning, relational understanding, orientation, mental rotation, visualization, synthetic tasks, real-world images, model evaluation

TL;DR: VLMs appear competent but struggle with true spatial reasoning, performing near random on our diagnostic tasks and revealing a critical weakness in current models.

Abstract: Vision-Language Models (VLMs) have captivated the research community by effectively merging visual and textual information, implying a holistic comprehension of the environment. These models find applications in tasks such as Image Captioning and Visual Question Answering, fostering the assumption that they perceive reality in a way similar to human cognition. However, this apparent understanding may be misleading. We argue that a critical component of comprehension—spatial reasoning—has been insufficiently addressed, as current benchmarks primarily test models' ability to identify object positions rather than evaluate genuine spatial logic. In this study, we aim to address this limitation. Drawing from the fundamental elements of human cognition, we developed a diagnostic framework designed to isolate the essential components of spatial reasoning: relational understanding, orientation, mental rotation, and visualization. We evaluated 17 state-of-the-art VLMs within both controlled synthetic settings and the complex variability of images captured in the real world. Results indicate a substantial gap in performance: the apparent competence of these models decreases significantly under spatial reasoning tasks that require any dynamic transformation and manipulation of spatial information. On average, their performance parallels random guessing, which highlights a major systematic weakness in spatial reasoning in current VLMs. In addition to providing evidence for this limitation, this study also provides the research community with a foundational framework for developing models that can accurately understand and reason about spatial properties in their environment.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 19475

Loading