See Better, Say Better: Vision-Augmented Decoding for Mitigating Hallucinations in Large Vision-Language Models

Xinyi Sun, Diandian Guo, Cong Cao, Fangfang Yuan, Dakui Wang, Yanbing Liu

Published: 2025, Last Modified: 20 Mar 2026NLPCC (1) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Large Vision-Language Models (LVLMs) often produce responses that misalign with factual information or visual input, a phenomenon known as hallucinations. While existing approaches often rely on extensive retraining or primarily focus on adjusting the text decoder, they struggle to address hallucinations originating from visual deficiencies. We introduce “See Better, Say Better”, a novel and training-free decoding strategy that leverages explicit visual information to mitigate hallucinations. Our approach employs a lightweight visual model to detect and localize objects within the image. This extracted object information is then used to guide the LVLM in generating richer, more visually grounded descriptions. Subsequently, we utilize both the object-level features and the generated descriptive information to perform a Vision-Augmented Decoding (VAD) process, emphasizing the enriched visual cues during output generation. Extensive experiments on the challenging POPE and MME datasets across multiple LVLMs demonstrate that our proposed VAD significantly alleviates hallucinations, leading to substantial improvements in the visual consistency and factual accuracy of the generated text.
Loading