Keywords: Vision-Language Models, VLMs, spatial reasoning, relational understanding, orientation, mental rotation, visualization, synthetic tasks, real-world images, model evaluation
TL;DR: VLMs appear competent but struggle with true spatial reasoning, performing near random on our diagnostic tasks and revealing a critical weakness in current models.
Abstract: Vision-Language Models (VLMs) have captivated the research community by effectively merging visual and textual information, implying a holistic comprehension of the environment. These models find applications in tasks such as Image Captioning and Visual Question Answering, fostering the assumption that they perceive reality in a way similar to human cognition. However, this apparent understanding may be misleading. We argue that a critical component of comprehension—spatial reasoning—has been insufficiently addressed, as current benchmarks often conflate visual recognition with spatial reasoning, or focus on static properties rather than the dynamic simulation required for genuine spatial logic. In this study, we aim to address this limitation through a targeted diagnostic approach. Drawing from the fundamental elements of human cognition, we developed a curated evaluation suite designed to isolate the essential components of spatial reasoning: relational understanding, orientation, mental rotation, and visualization. We evaluated 17 state-of-the-art VLMs across a strictly controlled set of 1800 samples, split between synthetic settings and real-world images. Results indicate a substantial gap in performance: the apparent competence of these models decreases significantly under spatial reasoning tasks that require any dynamic transformation and manipulation of spatial information. On average, their performance parallels random guessing, which highlights a major systematic weakness in spatial reasoning in current VLMs. In addition to providing evidence for this limitation, this study provides the research community with a foundational diagnostic framework for probing model capabilities regarding spatial properties in their environment.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 19475
Loading