Keywords: spatial reasoning, vision-language models, egocentric-allocentric alignment, multiview geometric reasoning, cognitive-inspired evaluation, dual representation
TL;DR: REMAP is a benchmark for evaluating multiview spatial reasoning, requiring alignment between symbolic maps and first-person views under viewpoint change.
Abstract: Building coherent world models requires agents to align local perceptual experience with global, abstract representations of space.
This paper introduces REMAP, a controlled benchmark for evaluating multiview spatial reasoning in vision-language models (VLMs).
Motivated by developmental research showing that humans flexibly align egocentric and allocentric representations across changes in viewpoint and orientation, the task requires agents to identify a target location by aligning an allocentric map with multiple egocentric observations.
Critically, this setup tests cross-view geometric correspondence rather than view-specific visual matching.
REMAP instantiates synthetic triangle environments with systematically varied angle configurations, enabling fine-grained analysis of sensitivity to different geometric relations and representations.
Evaluating 17 VLMs alongside human performance, we find that leading models outperform random baselines but remain substantially below average human accuracy.
Beyond this performance gap, models exhibit systematic, representation-specific failures: they show persistent weaknesses on side-based representations even when geometric cues are highly distinctive.
These findings reveal a substantial gap between human and model spatial reasoning, suggesting that current VLMs lack the cross-view geometric abstractions and struggle to robustly integrate partial observations needed for coherent world model construction.
Submission Number: 89
Loading