Fading Focus: Mitigating Visual Attention Degradation in Large Vision-Language Models

Chenhang Cui; Jiabing Yang; Yiyang Zhou; Peng Xia; Ying Wei; Huaxiu Yao

Fading Focus: Mitigating Visual Attention Degradation in Large Vision-Language Models

Chenhang Cui, Jiabing Yang, Yiyang Zhou, Peng Xia, Ying Wei, Huaxiu Yao

13 Sept 2024 (modified: 13 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Hallucination; Large Vision-Language Models; Decoding Strategy

TL;DR: An innovative image attention-guided key-value (KV) merging collaborative decoding strategy to mitigate hallucinations in LVLMs

Abstract: How can we ensure that Large Vision-Language Models (LVLMs) maintain strong attention to visual input throughout the inference process? Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated significant progress across multiple domains. However, these models still face the inherent challenge of integrating vision and language for collaborative inference, which often leads to "hallucinations," outputs that are not grounded in the corresponding images. Many efforts have been made to address these challenges, but each approach comes with its own limitations, such as high computational costs or expensive dataset annotation. Worse still, many of them fail to recognize the crucial role of visual attention in guiding the model’s response generation. In our research, we identify a key limitation in current LVLMs: the model's diminishing attention to visual input as the number of generated tokens increases, which results in performance degradation. To address this challenge, we propose \textbf{I}mage attention-guided \textbf{K}ey-value merging c\textbf{O}llaborative \textbf{D}ecoding (IKOD), a collaborative decoding strategy that generates image-focused sequences using key-value merging. This method derives logits from shorter sequences with higher image attention through key-value merging and combines them with those from the original decoding process, effectively mitigating attention decay. Importantly, IKOD requires no additional training or external tools, making it highly scalable and applicable to various models.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 512

Loading