The Halo Effect and Language Takeover: Spatiotemporal Attention Decay Explains Vision-Language Model Failures in Simple Visual Counting

The Halo Effect and Language Takeover: Spatiotemporal Attention Decay Explains Vision-Language Model Failures in Simple Visual Counting

ACL ARR 2026 January Submission6081 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, VLM, Multimodal model, Interpretability

Abstract: Despite their remarkable capabilities in complex multimodal reasoning, Vision Language Models (VLMs) exhibit a perplexing inability to perform elementary visual counting tasks reliably. Existing hypotheses, often centering on input resolution or patch tokenization, fail to fully explain the stochastic nature of these errors, particularly in multi-digit generation. In this work, we investigate the internal decision-making dynamics of VLMs (e.g., Qwen3-VL, Gemma3) through the lens of attention mechanisms. By leveraging a controlled synthetic dataset and introducing novel metrics for Visual \textit{Sparsity} and \textit{Entropy}, we discover a novel phenomenon: \textbf{Spatiotemporal Attention Decay}. Our analysis reveals two distinct failure modes. Spatially, models exhibit a \textbf{Halo Effect}, where attention focuses on the peripheral convex hull of object clusters rather than penetrating the geometric centers of individual instances. Temporally, we observe a phenomenon of \textbf{Language Takeover}: during auto-regressive decoding, visual grounding decays rapidly after the initial token. Quantitative analysis confirms that as attention sparsity drops and entropy rises, the generation of subsequent digits degenerates from visual perception into hallucination driven by language priors. These findings suggest that counting failures stem from the model's inability to maintain spatiotemporal focus, highlighting the need for mechanisms that enforce persistent visual grounding.

Paper Type: Short

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Interpretability and Analysis of Models for NLP, Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 6081

Loading