CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models
Keywords: Larger Vision-Language Model, Hallucination
TL;DR: We propose Caption-sensitive Attention Intervention (CAI), a training-free method, that refines caption-sensitive attention heads outputs during inference to enhance the fine-grained visual perception capability and mitigate object hallucination.
Abstract: Although Large Vision-Language Models (LVLMs) have demonstrated remarkable performance on downstream tasks, they frequently produce contents that deviate from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or decoding strategies which significantly increase inference time. In this work, we observe that LVLMs' attention to visual information is significantly enhanced when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern corresponding to caption queries to enhance LVLMs' visual perception capability. Specifically, we use probing techniques to identify attention heads that are highly sensitive to caption queries and accurately estimate optimized intervention directions for their outputs. This intervention strengthens LVLM's fine-grained visual perception capabilities, thereby effectively mitigating object hallucination. CAI reduced object hallucination by an average of 6.03% across five widely used LVLMs and five benchmarks including both discriminative and generative tasks, demonstrating state-of-the-art (SOTA) performance while incurring little additional inference cost and preserving other foundational capabilities.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 66
Loading