Keywords: Spatial Mental Model, Spatial Reasoning, Cognitive Map, Internal World Model, Large Vision Language Model
Abstract: People intuitively construct mental models of space beyond what they directly perceive, but can large visual-language models (VLMs) do the same with partial observations like limited views? We identify this significant gap for current VLMs via our new MINDCUBE benchmark with 17,530 questions and 2,919 images, evaluating how well VLMs build robust spatial mental models, representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation) for what if movements, to solve spatial reasoning on unseen space that lies beyond immediate perception.
We explore three approaches to approximating spatial mental models in VLMs: (1) View interpolation to visualize mental simulation, which surprisingly offers little benefit, highlighting the challenge of reasoning from limited views; (2) Textual reasoning chains, which effectively guide model thinking when supervised; and (3) Structured representations like cognitive maps, where ground truth maps help little, but training VLMs to generate and reason over their own maps yields substantial gains—even if the maps are imperfect. Training models to reason over these internal maps raises accuracy from 38.3% to 61.7% (+23.5%). Adding reinforcement learning further improves performance to 76.1% (+37.8%).
Our key insight is that no scaffolding of spatial mental models, actively construct-ing and utilizing spatial mental representations with flexible reasoning chains or processes, significantly improves understanding of unobservable space.
Supplementary Material: zip
Submission Type: Benchmark Paper (4-9 Pages)
NeurIPS Resubmit Bundle: pdf
NeurIPS Resubmit Summary: We have conducted extensive arguments and ablation experiments, and our core innovation lies in the insight gained through a cleverly designed benchmark. This benchmark revealed that current VLMs struggle to perform reasoning about unseen spaces from limited viewpoints. Therefore, we explored whether models could construct spatial mental models to function as an Internal World Model. We thoroughly investigated various scaffolding methods, ranging from viewpoint interpolation and free-form reasoning to structured cognitive maps.
NeurIPS Resubmit Attestation: I am an author of the referenced NeurIPS 2025 submission. I have the right to share the anonymous reviews/meta-review for the exclusive use of the workshop PCs/reviewers. I understand they will not be redistributed publicly.
Submission Number: 156
Loading