Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models
Track: tiny paper (up to 4 pages)
Keywords: Vision-Language Models, Multimodal Foundation Models, Hallucination, Interpretability, Attention Mechanism, Reliability, Self-Consistency
TL;DR: We demonstrate that spatial attention in Vision-Language Models is uncorrelated with accuracy; instead, true reliability is encoded in late-layer hidden states and generation consistency.
Abstract: Multimodal Foundation Models (MFMs) are rapidly evolving from simple pattern matchers to reasoning agents. As they do, the challenge of reliability, knowing when a model is hallucinating, becomes critical. A common intuition in the field, which we refer to as the Attention-Confidence Assumption, suggests that model reliability stems from "structural" visual perception: if a model focuses tightly on relevant image regions, its subsequent answer should be trustworthy. In contrast, scattered attention is assumed to signal confusion. We challenge this assumption through VLM Reliability Probe (VRP), a systematic cross-family investigation into reliability signals in contemporary Vision-Language Models (VLMs). We introduce "structural attention" metrics, including cluster counts $C_{k}$ and spatial entropy $H_{s}$ to quantify the coherence of the visual encoder's gaze. To capture the dynamics of this gaze, we further track attention evolution $\Delta H_{s}$ across all layers. This analysis reveals a critical "Symbolic Detachment" models often exhibit "Early Locking" of visual features only to diffuse attention in later layers, effectively severing the link between early perception and final generation. Contrary to the grounding hypothesis, our results demonstrate a "Cluster Failure": spatial attention patterns possess near-zero correlation ($R\approx0.001$) with model accuracy. Instead, we find that reliability is fundamentally a phenomenon of generation dynamics. Self-Consistency (SC), the agreement rate across sampled reasoning paths, emerges as the dominant predictor of truth ($R=0.429$). When model agreement is perfect, precision exceeds 90%. These findings suggest that for current VLM families, reliability signals are detached from visual grounding maps and are best retrieved via next-token prediction artifacts.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 49
Loading