Mitigating Hallucination Caused by Excessive Reliance on LLM within MLLM instead of Images

ACL ARR 2024 June Submission5100 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In the domain of multimodal generation and comprehension, multimodal large language models (MLLMs), which integrate visual encoders with large language models, have garnered significant success. However, solely relying on modal connection layers/modules to unify these models can lead to a neglect of image information, resulting in visual hallucinations. This manifests as generated text that is independent of the image content, such as descriptions of objects not present within the image. To mitigate this issue, we introduce a fine-tuning approach: Adversarial Contrast Dual Fine-tuning (ACD for short). This approach leverages the MLLM itself and employs the Fast Gradient Sign Method (FGSM) to generate adversarial image samples. During fine-tuning, both the original and adversarial images are utilized to perform dual contrastive fine-tuning on the MLLM. The experimental results show that our method significantly reduces hallucinations without any external annotations.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodal Large Language model, visual hallucination
Languages Studied: N/A
Submission Number: 5100
Loading