Selective Seeing: Context-Aware Attention Interventions for Mitigating Hallucinations in Large Vision-Language Models
Keywords: Large vision-language model, Hallucination, Intervention
TL;DR: We present Context-aware Attention Intervention (CAI), a training-free inference mechanism that embodies the idea of “selectively seeing”: reinforcing visual grounding only when and where it is needed.
Abstract: Large Vision-Language Models (LVLMs) excel at multimodal tasks but are susceptible to hallucinations, generating text inconsistent with visual inputs. Existing methods mitigate hallucinations by uniformly strengthening visual signals, inadvertently amplifying irrelevant regions and spurious correlations. To address this, we present Context-aware Attention Intervention (CAI), a training-free inference mechanism that embodies the idea of “selectively seeing”: reinforcing visual grounding only when and where it is needed. Our method first estimates tokenimage similarity to locate semantically relevant regions, and then conditionally amplifies their attention only for high-entropy tokens in deeper layers where visual grounding tends to degrade. This token-specific, uncertainty-aware design strengthens visual grounding without overwhelming the model with irrelevant signals. Extensive experiments show that CAI effectively mitigates hallucinations and achieves state-of-the-art performance across multiple benchmarks.
Supplementary Material: pdf
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4966
Loading