Mitigating Large Vision Language Model Hallucinations  via Entity-centric Multimodal Preference Optimization

Mitigating Large Vision Language Model Hallucinations via Entity-centric Multimodal Preference Optimization

ACL ARR 2025 February Submission8028 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Visual Language Models (LVLMs) have demonstrated impressive capabilities across tasks. However, their trustworthiness is often challenged by hallucinations. We attribute this issue to modality misalignment and the inherent hallucinations of Large Language Models (LLMs), which serve as the ``brain'' of LVLMs. Multimodal human preference alignment is a widely used approach to mitigate LVLM hallucinations. However, existing methods focus on response-level alignment while neglecting alignment at the image and instruction levels, leading to modality misalignment. For this, we propose Entity-centric Multimodal Preference Optimization (EMPO), which achieves better modality alignment than existing human preference alignment methods. Besides, to overcome the scarcity of high-quality multimodal preference data and help LVLMs mitigate hallucinations, we introduce a fine-grained multimodal preference data construction process that labels preferences at the entity level—all without requiring manual annotations. Experiments on two human preference datasets and five multimodal hallucination benchmarks demonstrate the effectiveness of EMPO, reducing hallucination rates by 80.4\% on Object HalBench and 52.6\% on MM HalBench, thereby enhancing the trustworthiness of LVLMs.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 8028

Loading