Imitating the Truth: Attention-aware Truth-Guided Enhancement for Hallucination Mitigation in Large Vision-Language Models
Keywords: Hallucination Mitigation, Large Vision-Language Models, Attention Intervention
TL;DR: We propose AGE, a training-free framework that mitigates hallucinations in LVLMs by imitating truth-grounded attention patterns of real tokens, yielding more accurate and trustworthy multimodal reasoning.
Abstract: Large Vision-Language Models (LVLMs) achieve impressive multimodal reasoning but remain prone to hallucinations, generating content inconsistent with visual evidence.
Existing mitigation methods often rely on auxiliary modules or coarse decoding-time adjustments, overlooking the fine-grained dynamics that distinguish truthful (real) tokens from hallucinatory ones.
In this paper, we introduce \textbf{AGE (Attention-aware Truth-Guided Enhancement)}, a training-free framework that performs fine-grained, layer-wise interventions guided by attention patterns of real tokens.
Our analysis reveals that real and hallucinated tokens follow distinct stage-specific attention behaviors, and hallucinations emerge when models fail to reproduce these behaviors.
AGE addresses this by introducing two lightweight interventions: (i) Imitating the image attention, derived from discrepancies between real and hallucinated tokens, and (ii) Imitating the text attention when semantic grounding is required.
Extensive experiments on widely used benchmarks, including COCO Image Captioning, POPE, and MME, demonstrate that AGE consistently mitigates hallucinations across diverse LVLMs such as LLaVA, MiniGPT-4, and mPLUG-Owl2, without additional training or loss of fluency.
Our results highlight that imitating truth-grounded attention dynamics is a simple yet powerful principle to improve the reliability of LVLMs.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16682
Loading