Mitigating Large Vision Language Model Hallucinations via Entity-centric Multimodal Preference Optimization
Abstract: Large Visual Language Models (LVLMs) have demonstrated impressive capabilities across tasks.
However, their trustworthiness is often challenged by hallucinations.
We attribute this issue to modality misalignment and the inherent hallucinations of Large Language Models (LLMs), which serve as the ``brain'' of LVLMs.
Multimodal human preference alignment is a widely used approach to mitigate LVLM hallucinations. However, existing methods focus on response-level alignment while neglecting alignment at the image and instruction levels, leading to modality misalignment.
For this, we propose Entity-centric Multimodal Preference Optimization (EMPO), which achieves better modality alignment than existing human preference alignment methods.
Besides, to overcome the scarcity of high-quality multimodal preference data and help LVLMs mitigate hallucinations, we introduce a fine-grained multimodal preference data construction process that labels preferences at the entity level—all without requiring manual annotations.
Experiments on two human preference datasets and five multimodal hallucination benchmarks demonstrate the effectiveness of EMPO, reducing hallucination rates by 80.4\% on Object HalBench and 52.6\% on MM HalBench, thereby enhancing the trustworthiness of LVLMs.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 8028
Loading