A Dual-Perspective Decoding for Hallucination Mitigation in Large Vision-Language Models

A Dual-Perspective Decoding for Hallucination Mitigation in Large Vision-Language Models

ACL ARR 2025 May Submission1508 Authors

17 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Vision-Language Models (LVLMs) face the challenge of object hallucination, where the model generates descriptions of nonexistent objects. This issue primarily arises from the failure of the visual encoder to attend to detailed regions and the tendency of the language model to favor contextual plausibility over visual evidence during generation. In this work, we propose a dual-perspective decoding framework that jointly optimizes text generation from both visual and textual views to address hallucinations caused by image-text misalignment. Our framework aligns generated text with visual content at both the sentence and word levels from the textual perspective, while simultaneously ensuring that visual objects are aligned with their corresponding textual semantics from the visual perspective. Extensive experiments demonstrate that our method significantly reduces object hallucination and achieves superior image-text alignment compared to existing state-of-the-art methods. Notably, our method achieves significant improvements of 7.5% to 19.2% over previous approaches under the CHAIR evaluation metrics, highlighting its effectiveness in enhancing the visual faithfulness of generation.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Hallucination Mitigation, Natural Language Generation, Large Vision-Language Models, Decoding Strategies

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Keywords: Natural Language Generation, Hallucination Mitigation, Vision-Language Models, Image Captioning

Submission Number: 1508

Loading