Visual Tokens Are Not Equal: Alleviating Hallucination in Multimodal Large Language Models via Aligning attention

18 Sept 2025 (modified: 24 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: MLLM, Hallucination, Attention
TL;DR: Alleviating Hallucination in MLLM via Aligning attention
Abstract: Hallucination remains a significant challenge for Multimodal Large Language Models (MLLMs), hindering their reliability across various tasks. Despite extensive research from various perspectives, the underlying causes remain unclear. In this paper, we conduct empirical analyses and identify a progressive attention shift in the decoding process, where the decoder’s attention over visual tokens gradually diverges from the vision encoder’s. Based on these observations, we infer that this shift systematically reduces the model’s focus on semantically important visual tokens, leading to hallucinations. Building on this finding, we propose Align Attention with Image (AAI), a decoding-time method that explicitly aligns the decoder’s attention over visual tokens with the self-attention of the vision encoder. Specifically, AAI caches the encoder’s visual self-attention and leverages it as a reference signal to guide the decoder’s attention distribution toward that of the image. AAI is decoding-agnostic and can be seamlessly integrated with both classical and modern decoding strategies across different MLLMs. We evaluate AAI on widely used hallucination benchmarks and show that it consistently reduces hallucinations without sacrificing semantic completeness. All relevant experimental code is included in the supplementary appendix and will be released publicly.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11063
Loading