Keywords: Spatial Mental Model, Limited Views, Partial Observation, Spatial Reasoning, Cognitive Map
TL;DR: We propose MindCube and find existing VLMs perform poorly on it. Supervising models to first generate cognitive maps and then reason upon them proves to be a quite effective approximation for spatial mental modeling from limited views.
Abstract: Humans intuitively construct mental models of space beyond what they directly perceive, but can large visual-language models (VLMs) do the same with partial observations like **limited views**? We identify this significant gap for current VLMs via our new **MindCube** benchmark with $21,154$ questions and $3,268$ images, evaluating how well VLMs build robust spatial mental models, representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for "what-if" movements), to solve spatial reasoning on **unseen** space that goes beyond immediate perception.
We explore three approaches to approximating spatial mental models in VLMs:
(1) View interpolation to visualize mental simulation, which surprisingly offers little benefit, highlighting the challenge of reasoning from limited views;
(2) Supervising the model on singular abilities (generating cognitive maps or reasoning chains alone) yields only marginal gains; and
(3) The key breakthrough is a synergistic approach that involves jointly training the model to first generate a cognitive map and then reason upon it, which results in substantial performance gains.
This mapping-then-reasoning paradigm proves highly effective:
Training models to reason over these internal maps improves from `37.8%` to `60.8%` (`+23.0%`). Adding reinforcement learning further improves performance to `70.7%` (`+32.9%`).
Our key insight is that such scaffolding of spatial mental models, actively utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.
Submission Number: 2
Loading