Can VLMs Handle Multi-hop Compositional Spatial Reasoning?

Youngwan Lee; Soojin Jang; Yoorhim Cho; Seunghwan Lee; Yong-Ju Lee; Sung Ju Hwang

Can VLMs Handle Multi-hop Compositional Spatial Reasoning?

Youngwan Lee, Soojin Jang, Yoorhim Cho, Seunghwan Lee, Yong-Ju Lee, Sung Ju Hwang

Published: 09 May 2026, Last Modified: 09 May 2026MUSIEveryoneRevisionsCC BY 4.0

Keywords: vision, language, reasoning

Abstract: Spatial reasoning is a critical capability for Vision–Language Models (VLMs), particularly when deployed as Vision–Language–Action (VLA) agents in real-world environments. However, existing benchmarks predominantly focus on simple, single-hop spatial questions, falling short of capturing the multi-hop reasoning and precise visual grounding required in practical scenarios. To address this gap, we introduce MultihopSpatial, a benchmark designed for multi-hop compositional spatial reasoning with 1–3 hop questions across ego- and exo-centric perspectives. Through extensive evaluation of 30 state-of-the-art VLMs, we demonstrate that compositional spatial reasoning remains a significant challenge for current VLMs.

Previously Accepted: No

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 16

Loading