Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

Published: 09 May 2026, Last Modified: 09 May 2026MUSIEveryoneRevisionsCC BY 4.0
Keywords: Spatial understanding, multimodal reasoning
TL;DR: Imaginative perceptions added to the unified models improve spatial reasoning in multimodal language models.
Abstract: Vision-language models (VLMs) excel at many tasks, yet continue to struggle with spatial reasoning, i.e. tasks that require information not directly observable in the input. Many spatial questions require simulating unseen viewpoints or integrating multiple partial observations into a unified spatial map. Humans naturally support through imagination. Prior work introduces intermediate visual representations (e.g., visual thoughts, depth, or box tokens), but these largely refine structures already visible rather than predicting the missing spatial structure implied by the input. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive under an alternative spatial configuration while remaining consistent with the observed evidence. To study this capability, we formulate three tasks that require imaginative perception: Perspective Taking, Path Tracing, and Multiview Counting. For each task, we construct a dataset of roughly 20K examples spanning real-world and simulated environments, paired with ground-truth intermediate imaginations and final answers, and curated evaluation benchmarks. Using a unified MLM backbone as a baseline, we demonstrate that supervising imaginative intermediates provides a principled way to improve spatial reasoning over unobserved structure and enables more faithful, interpretable spatial inference.
Supplementary Material: pdf
Previously Accepted: No
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 22
Loading