Keywords: vision, language, reasoning
Abstract: Spatial reasoning is a critical capability for Vision–Language Models (VLMs), particularly when deployed as Vision–Language–Action (VLA) agents in real-world environments. However, existing benchmarks predominantly focus on simple, single-hop spatial questions, falling short of capturing the multi-hop reasoning and precise visual grounding required in practical scenarios.
To address this gap, we introduce MultihopSpatial, a benchmark designed for multi-hop compositional spatial reasoning with 1–3 hop questions across ego- and exo-centric perspectives. Through extensive evaluation of 30 state-of-the-art VLMs, we demonstrate that compositional spatial reasoning remains a significant challenge for current VLMs.
Previously Accepted: No
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 16
Loading