Spatial Mental Modeling from Limited Views

Baiqiao Yin; Qineng Wang; Pingyue Zhang; Jianshu Zhang; Kangrui Wang; Zihan Wang; Jieyu Zhang; Keshigeyan Chandrasegaran; Han Liu; Ranjay Krishna; Saining Xie; Manling Li; Jiajun Wu; Li Fei-Fei

Spatial Mental Modeling from Limited Views

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, Li Fei-Fei

Published: 23 Sept 2025, Last Modified: 22 Nov 2025LAWEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Spatial Mental Model, Spatial Reasoning, Cognitive Map, Internal World Model, Large Vision Language Model

Abstract: People intuitively construct mental models of space beyond what they directly perceive, but can large visual-language models (VLMs) do the same with partial observations like limited views? We identify this significant gap for current VLMs via our new MINDCUBE benchmark with 17,530 questions and 2,919 images, evaluating how well VLMs build robust spatial mental models, representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation) for what if movements, to solve spatial reasoning on unseen space that lies beyond immediate perception. We explore three approaches to approximating spatial mental models in VLMs: (1) View interpolation to visualize mental simulation, which surprisingly offers little benefit, highlighting the challenge of reasoning from limited views; (2) Textual reasoning chains, which effectively guide model thinking when supervised; and (3) Structured representations like cognitive maps, where ground truth maps help little, but training VLMs to generate and reason over their own maps yields substantial gains—even if the maps are imperfect. Training models to reason over these internal maps raises accuracy from 38.3% to 61.7% (+23.5%). Adding reinforcement learning further improves performance to 76.1% (+37.8%). Our key insight is that no scaffolding of spatial mental models, actively construct-ing and utilizing spatial mental representations with flexible reasoning chains or processes, significantly improves understanding of unobservable space.

Supplementary Material: zip

Submission Type: Benchmark Paper (4-9 Pages)

NeurIPS Resubmit Bundle: pdf

NeurIPS Resubmit Summary: We have conducted extensive arguments and ablation experiments, and our core innovation lies in the insight gained through a cleverly designed benchmark. This benchmark revealed that current VLMs struggle to perform reasoning about unseen spaces from limited viewpoints. Therefore, we explored whether models could construct spatial mental models to function as an Internal World Model. We thoroughly investigated various scaffolding methods, ranging from viewpoint interpolation and free-form reasoning to structured cognitive maps.

NeurIPS Resubmit Attestation: I am an author of the referenced NeurIPS 2025 submission. I have the right to share the anonymous reviews/meta-review for the exclusive use of the workshop PCs/reviewers. I understand they will not be redistributed publicly.

Submission Number: 156

Loading