Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models

Jianing Qi; Jiawei Liu; Hao Tang; Zhigang Zhu

Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models

Jianing Qi, Jiawei Liu, Hao Tang, Zhigang Zhu

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: interpretability, visualization, vision language model

TL;DR: We find norm mismatch between vision and text tokens and build interpretability tools to diagnose how VLMs process vision tokens

Abstract: Vision–Language Models (VLMs) excel at identifying and describing objects but often fail at spatial reasoning. We study why VLMs, such as LLaVA, under-utilize spatial cues despite having positional encodings and spatially rich vision-encoder features. Our analysis reveals a key imbalance: vision token embeddings have much larger norms than text tokens, suppressing LLM's position embedding. To expose this mechanism, we developed three interpretability tools: (1) the Position Sensitivity Index, which quantifies reliance on token order, (2) the Cross-Modality Balance, which reveals attention head allocation patterns, and (3) a RoPE Sensitivity probe, which measures dependence on rotary positional embeddings. These tools uncover that vision tokens and system prompts dominate attention. We validated our mechanistic understanding through targeted interventions that predictably restore positional sensitivity. These findings reveal previously unknown failure modes in multimodal attention and demonstrate how interpretability analysis can guide principled improvements. We will release code upon publication.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Submission Number: 10382

Loading