Mitigating Object Hallucination in Large Vision-Language Models through Adversarial Contrastive Finetuning
Keywords: Object Hallucication, Large vision-language models, Large vision language model, Adversarial example, Contrastive learning
Abstract: In recent years, large vision-language models (LVLMs) have made remarkable progress across a variety of vision-language tasks. However, they remain prone to object hallucination like generating descriptions of nonexistent objects in images. To explore the internal mechanism of object hallucination, we collected normal and hallucinated image-text pairs and performed quantitative analysis based on cosine similarity and qualitative analysis based on smooth Grad-CAM. We found that LVLMs may cause hallucinations due to incorrect extraction of image features and mismatch between image and text features. Inspired by these findings, we propose an adversarial contrast fine-tuning (ACFT) method designed to enhance the alignment between visual and textual embedding and encourage the visual modality to focus on the correct image features, thus mitigating object hallucinations. The key approach involves automatically generating paired positive and negative examples using an adversarial hallucination attribute flipping (AHAF) method, followed by contrastive fine-tuning of the LVLM. Through extensive experiments, we show that ACFT achieves state-of-the-art performance on multiple benchmarks, e.g. outperforming existing approaches like VCD, OPERA and VTI, etc. on multiple benchmarks like POPE and MME.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 7735
Loading