Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: interpretability, visualization, vision language model
TL;DR: We find norm mismatch between vision and text tokens and build interpretability tools to diagnose how VLMs process vision tokens
Abstract: Vision–Language Models (VLMs) excel at identifying and describing objects but often fail at spatial reasoning. We study why VLMs, such as LLaVA, under-utilize spatial cues despite having positional encodings and spatially rich vision-encoder features. Our analysis reveals a key imbalance: vision token embeddings have much larger norms than text tokens, suppressing LLM's position embedding. To expose this mechanism, we developed three interpretability tools: (1) the Position Sensitivity Index, which quantifies reliance on token order, (2) the Cross-Modality Balance, which reveals attention head allocation patterns, and (3) a RoPE Sensitivity probe, which measures dependence on rotary positional embeddings. These tools uncover that vision tokens and system prompts dominate attention. We validated our mechanistic understanding through targeted interventions that predictably restore positional sensitivity. These findings reveal previously unknown failure modes in multimodal attention and demonstrate how interpretability analysis can guide principled improvements. We will release code upon publication.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 10382
Loading