Keywords: Large Language Mode, Vision-Language Model, Spatial Reasoning, Spatial Agent, Active Exploration
Abstract: Spatial embodied intelligence under partial observability requires agents to actively acquire missing information rather than passively consume complete observations. While multimodal foundation models excel at passive perception and reasoning, their ability to support self-directed exploration for building and maintaining coherent spatial beliefs remains understudied. We propose Theory of Space, defined as an agent’s ability to construct, revise, and exploit a spatial belief through active exploration under partial observability. We implement Theory of Space as a benchmark in textual and visual environments, where the goal is curiosity-driven exploration to build a complete and accurate spatial belief. A key innovation is spatial belief probing, which prompts agents to externalize their internal spatial belief as a cognitive map at each step, enabling direct measurement of belief quality. Evaluating state-of-the-art models on downstream tasks reveals three bottlenecks: (1) the \textbf{Active-Passive Gap}, where performance drops when agents must autonomously gather information (e.g., \textsc{GPT-5.2}: $57.1{\to}46.0$); (2) \textbf{Inefficiency}, with redundant and unsystematic exploration; and (3) unstable global beliefs, where spatial knowledge degrades over time. A false-belief paradigm further reveals \textbf{Belief Inertia}, especially severe in vision-based models.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Type: Research Paper
Archival Status: Non-archival
Submission Number: 20
Loading