Abstract: Vision--language models (VLMs) achieve strong performance on many multimodal
benchmarks but remain brittle on spatial reasoning tasks that require aligning
abstract overhead representations with egocentric views. We introduce
\textbf{m2sv}, a scalable benchmark for map-to-street-view spatial reasoning that
asks models to infer camera viewing direction by aligning a north-up overhead map
with a Street View image captured at the same real-world intersection.
We release \textbf{m2sv-20k}, a geographically diverse benchmark with controlled
ambiguity, along with \textbf{m2sv-sft-11k}, a curated set of structured reasoning
traces for supervised fine-tuning.
Despite strong performance on existing multimodal benchmarks, the best evaluated
VLM achieves only 65.2\% accuracy on m2sv, far below the human baseline of 95\%.
While supervised fine-tuning and reinforcement learning yield consistent gains,
cross-benchmark evaluations reveal limited transfer.
Beyond aggregate accuracy, we systematically analyze \emph{difficulty} in
map-to-street-view reasoning using both structural signals and human effort, and
conduct an extensive failure analysis of adapted open models. Our findings
highlight persistent gaps in geometric alignment, evidence aggregation, and
reasoning consistency, motivating future work on grounded spatial reasoning
across viewpoints.
Loading