Keywords: spatial visualization, spatial cognition, spatial reasoning, VLMs
Abstract: Spatial visualization is the mental ability to imagine, transform, and manipulate the spatial characteristics of objects and actions. This intelligence is a part of human cognition where actions and perception are connected on a mental level. Do state-of-the-art Vision-Language Models (VLMs) also exhibit this ability? To explore this, we develop MentalBlackboard, an open-ended spatial visualization benchmark for Paper Folding and Hole Punching tests within two core tasks: prediction and planning. Our prediction experiments reveal that models mostly overpredict the final hole numbers and struggle with applying symmetrical transformations, even when they predict the sequence of unfolding steps correctly. The backward folding process (folding the paper away from the camera/observer), which leads to limited vision, reduces the accuracy of spatial arrangement construction. Rotations, which alter the orientation of the unfolding actions, introduce a significant challenge for models to understand the physical orientation of the paper. The planning task, in which models are required to identify the sequence of folds that match the final hole pattern, shows models' limitations in analyzing symmetrical relations and creating the multi-stage symmetry process. In the task of generalization, which does not require spatial visualization, models reason through the visual analogies involving two visual examples of the same paper-folding process, along with a distinct spatial property and text-based hole information. Although the best-performing model, o3, achieves a peak performance of 71.6\% in transferring spatial data, it only obtains 25\% accuracy on text-based prediction tasks. Claude Opus 4.1 achieves the highest planning score with 10\%. The field-wise performance shows that models struggle more with locating and orienting the holes.
Primary Area: datasets and benchmarks
Submission Number: 21900
Loading