Mitigating Object Hallucination in Large Vision-Language Models via Visual Attention Direct Preference Optimization

Published: 01 Jan 2025, Last Modified: 12 Nov 2025ICME 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Large Vision-Language Models (LVLMs) suffer from severe object hallucinations, leading them to frequently generate outputs that do not correspond to the image content, significantly reducing the credibility and reliability of their responses. Recent research has attempted to enhance LVLMs by employing Direct Preference Optimization (DPO) to reduce hallucinations and improve response quality. However, these approaches typically utilize text-only preference response pairs, neglecting the influence of visual input in optimizing LVLMs. In this paper, we propose VA-DPO, a multimodal optimization objective. VA-DPO leverages the LVLMs' attention to select and corrupt critical parts of images, thereby constructing visual preference image pairs. This approach integrates both text and visual preference optimization objectives to achieve effective alignment optimization of LVLMs. We conduct extensive experiments on LVLMs of different sizes, and the results demonstrate that VA-DPO effectively reduces hallucinations across various tasks. Compared to other hallucination mitigation approaches, VA-DPO achieves more competitive results.
Loading