See Better, Say Better: Vision-Augmented Decoding for Mitigating Hallucinations in Large Vision-Language Models
Abstract: Large Vision-Language Models (LVLMs) often produce responses that misalign with factual information or visual input, a phenomenon known as hallucinations. While existing approaches often rely on extensive retraining or primarily focus on adjusting the text decoder, they struggle to address hallucinations originating from visual deficiencies. We introduce “See Better, Say Better”, a novel and training-free decoding strategy that leverages explicit visual information to mitigate hallucinations. Our approach employs a lightweight visual model to detect and localize objects within the image. This extracted object information is then used to guide the LVLM in generating richer, more visually grounded descriptions. Subsequently, we utilize both the object-level features and the generated descriptive information to perform a Vision-Augmented Decoding (VAD) process, emphasizing the enriched visual cues during output generation. Extensive experiments on the challenging POPE and MME datasets across multiple LVLMs demonstrate that our proposed VAD significantly alleviates hallucinations, leading to substantial improvements in the visual consistency and factual accuracy of the generated text.
External IDs:dblp:conf/nlpcc/SunGCYWL25
Loading