Vocabulary Fixation Reveals Visual Attention Sink for Hallucination Mitigation in LVLMs

08 Sept 2025 (modified: 19 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Vision-Language Models, Visual Attention Sink, Hallucination
TL;DR: We discover that visual attention sink tokens in LVLMs exhibit a predictable "Vocabulary Fixation" behavior, enabling us to propose SAVAE, a training-free method that significantly reduces hallucination at no extra computational cost.
Abstract: Large Vision-Language Models (LVLMs) show remarkable multimodal progress, but their reliability is undermined by hallucinations, the tendency to generate text that contradicts visual input. Recent work has established a strong link between hallucination and the model's attention to visual tokens. However, the current understanding of the Visual Attention Sink (VAS) phenomenon---where LVLMs persistently assign high attention to uninformative background tokens---remains superficial, leaving both its underlying mechanisms and its connection to the hallucination phenomenon unexplored. In this work, we present the first in-depth analysis of VAS. Using logit lens, we uncover a key property we term **Vocabulary Fixation**: VAS tokens consistently map to a small, fixed set of semantically vacuous words across all layers. Based on this observation, we propose **Vocabulary Fixation-Based Identification (VFI)** to reliably localize visual sink tokens in LVLMs. Furthermore, we establish a strong correlation between VAS and hallucination, and introduce the *Non-Sink Visual Attention Ratio (NVAR)*, a novel metric to precisely identify attention heads critical for mitigating hallucination. Building on this foundation, we propose **Sink-Aware Visual Attention Enhancement (SAVAE)**, a training-free method that adaptively strengthens the attention of these targeted heads to salient visual content during inference. Extensive experiments across multiple LVLMs and benchmarks demonstrate that SAVAE significantly outperforms existing decoding strategies in mitigating hallucination, while introducing no additional computational overhead.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 2894
Loading