Spatial Mental Modeling from Limited Views

Baiqiao Yin; Qineng Wang; Pingyue Zhang; Jianshu Zhang; Kangrui Wang; Zihan Wang; Jieyu Zhang; Keshigeyan Chandrasegaran; Han Liu; Ranjay Krishna; Saining Xie; Manling Li; Jiajun Wu; Li Fei-Fei

Spatial Mental Modeling from Limited Views

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, Li Fei-Fei

Published: 20 Aug 2025, Last Modified: 19 Oct 2025SP4VEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Spatial Mental Model, Limited Views, Partial Observation, Spatial Reasoning, Cognitive Map

TL;DR: We propose MindCube and find existing VLMs perform poorly on it. Supervising models to first generate cognitive maps and then reason upon them proves to be a quite effective approximation for spatial mental modeling from limited views.

Abstract: Humans intuitively construct mental models of space beyond what they directly perceive, but can large visual-language models (VLMs) do the same with partial observations like **limited views**? We identify this significant gap for current VLMs via our new **MindCube** benchmark with $21,154$ questions and $3,268$ images, evaluating how well VLMs build robust spatial mental models, representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for "what-if" movements), to solve spatial reasoning on **unseen** space that goes beyond immediate perception. We explore three approaches to approximating spatial mental models in VLMs: (1) View interpolation to visualize mental simulation, which surprisingly offers little benefit, highlighting the challenge of reasoning from limited views; (2) Supervising the model on singular abilities (generating cognitive maps or reasoning chains alone) yields only marginal gains; and (3) The key breakthrough is a synergistic approach that involves jointly training the model to first generate a cognitive map and then reason upon it, which results in substantial performance gains. This mapping-then-reasoning paradigm proves highly effective: Training models to reason over these internal maps improves from `37.8%` to `60.8%` (`+23.0%`). Adding reinforcement learning further improves performance to `70.7%` (`+32.9%`). Our key insight is that such scaffolding of spatial mental models, actively utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.

Submission Number: 2

Loading