Do Vision Language Models Rotate in Mind? Evaluating Spatial Transformation Reasoning

20 Sept 2025 (modified: 06 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision-language models, spatial reasoning, multimodal reasoning
TL;DR: Evaluate the spatial reasoning capabilities of VLMs and LLMs by introducing benchmarks that assess their understanding of geometric transformations, such as rotation and translation, across various tasks.
Abstract: Vision-language models (VLMs) have achieved impressive performance across diverse tasks, yet their ability to mentally perform spatial transformations—rotating, translating, and manipulating objects—remains poorly understood. We present \textbf{TransformEval}, a systematic benchmark that evaluates spatial transformation reasoning across 2D shapes, 3D mental rotation, and multi-object scenes, uniquely distinguishing between \textit{state prediction} (determining outcomes after transformations) and \textit{transformation inference} (recovering operations from state changes). Our evaluation of state-of-the-art VLMs reveals fundamental limitations: models struggle with basic transformations that humans solve effortlessly, consistently perform better at inferring transformations than predicting their outcomes—opposite to human cognitive patterns—and frequently fail at translation operations and transformation ordering. These findings suggest that VLMs rely on pattern matching rather than mental simulation, lacking the spatial reasoning capabilities necessary for applications in robotics, augmented reality, and other domains requiring genuine spatial intelligence. Our benchmark provides a framework for measuring progress toward human-like spatial understanding in vision-language models.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23377
Loading