Visual Tokens Are Not Equal: Alleviating Hallucination in Multimodal Large Language Models via Aligning attention

Xuhong Li; Wang Zhifeng; Haishuai Wang; LiminZeng

Visual Tokens Are Not Equal: Alleviating Hallucination in Multimodal Large Language Models via Aligning attention

Xuhong Li, Wang Zhifeng, Haishuai Wang, LiminZeng

18 Sept 2025 (modified: 24 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: MLLM, Hallucination, Attention

TL;DR: Alleviating Hallucination in MLLM via Aligning attention

Abstract: Hallucination remains a significant challenge for Multimodal Large Language Models (MLLMs), hindering their reliability across various tasks. Despite extensive research from various perspectives, the underlying causes remain unclear. In this paper, we conduct empirical analyses and identify a progressive attention shift in the decoding process, where the decoder’s attention over visual tokens gradually diverges from the vision encoder’s. Based on these observations, we infer that this shift systematically reduces the model’s focus on semantically important visual tokens, leading to hallucinations. Building on this finding, we propose Align Attention with Image (AAI), a decoding-time method that explicitly aligns the decoder’s attention over visual tokens with the self-attention of the vision encoder. Specifically, AAI caches the encoder’s visual self-attention and leverages it as a reference signal to guide the decoder’s attention distribution toward that of the image. AAI is decoding-agnostic and can be seamlessly integrated with both classical and modern decoding strategies across different MLLMs. We evaluate AAI on widely used hallucination benchmarks and show that it consistently reduces hallucinations without sacrificing semantic completeness. All relevant experimental code is included in the supplementary appendix and will be released publicly.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 11063

Loading