Keywords: vision-language models, spatial reasoning, geospatial understanding, multimodal benchmarks, map-to-image alignment, grounded reasoning
TL;DR: A scalable benchmark shows that vision–language models degrade sharply with spatial difficulty when aligning overhead maps and street-level views, remaining far below human performance despite fine-tuning and reinforcement learning.
Abstract: Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce **m2sv**, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release **m2sv-20k**, a geographically diverse benchmark with controlled ambiguity, along with **m2sv-sft-11k**, a curated set of structured reasoning traces for supervised fine-tuning.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 68
Loading