Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Shiqi Chen; Tongyao Zhu; Ruochen Zhou; Jinghan Zhang; Siyang Gao; Juan Carlos Niebles; Mor Geva; Junxian He; Jiajun Wu; Manling Li

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, Manling Li

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

TL;DR: We find that the attention pattern in VLMs can be adjusted adaptively to modify the focus area without requiring additional training.

Abstract: Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing “under” or “behind” relationships between only two objects, pose significant challenges for current VLMs. We believe it is crucial to use the lens of mechanism interpretability, opening up the model and diving into model’s internal states to examine the interactions between image and text tokens during spatial reasoning. Our analysis of attention behaviors reveals significant differences in how VLMs allocate attention to image versus text. By tracing the areas of images that receive the highest attention scores throughout intermediate layers, we observe a notable pattern: errors often coincide with attention being misdirected towards irrelevant objects within the image. Moreover, such attention patterns exhibit substantial differences between familiar (e.g., “on the left side of ”) and unfamiliar (e.g.,“in front of ”) spatial relationships. Motivated by these findings, we propose ADAPTVIS based on inference-time confidence scores to sharpen the attention on highly relevant regions when the model exhibits high confidence, while smoothing and broadening the attention window to consider a wider context when confidence is lower. This training-free decoding method shows significant improvement (e.g., up to a 50 absolute point improvement) on spatial reasoning benchmarks such as WhatsUp and VSR with negligible additional cost.

Lay Summary: We analyze the failure of spatial reasoning in Vision-Language Models (VLMs) through the lens of attention to assess whether they "look" at the correct regions. Our findings reveal a strong bias: VLMs often fail to attend to the right spatial locations and show low confidence when handling unfamiliar relationships such as in front of or behind, while performing better on familiar ones like left or right. To address this, we propose AdaptVis, a confidence-based decoding method. When the model is confident, we sharpen the attention distribution to focus more precisely on relevant image regions; when uncertain, we smooth the distribution to allow for broader contextual exploration to make the model see other places. Our experiment results show that AdaptVis could achieve up to 50 absolute points on spatial reasoning benchmarks like WhatsUp for LLaVA models, indicating the effectiveness of our decoding method.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/shiqichen17/AdaptVis.git

Primary Area: Deep Learning->Large Language Models

Keywords: Interpretability, Attention mechanism, VLM, Spatial Reasoning

Submission Number: 9020

Loading