Abstract: Large Vision-Language Models (LVLMs) face the challenge of object hallucination, where the model generates descriptions of nonexistent objects. This issue primarily arises from the failure of the visual encoder to attend to detailed regions and the tendency of the language model to favor contextual plausibility over visual evidence during generation. In this work, we propose a dual-perspective decoding framework that jointly optimizes text generation from both visual and textual views to address hallucinations caused by image-text misalignment. Our framework aligns generated text with visual content at both the sentence and word levels from the textual perspective, while simultaneously ensuring that visual objects are aligned with their corresponding textual semantics from the visual perspective. Extensive experiments demonstrate that our method significantly reduces object hallucination and achieves superior image-text alignment compared to existing state-of-the-art methods. Notably, our method achieves significant improvements of 7.5% to 19.2% over previous approaches under the CHAIR evaluation metrics, highlighting its effectiveness in enhancing the visual faithfulness of generation.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Hallucination Mitigation, Natural Language Generation, Large Vision-Language Models, Decoding Strategies
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Keywords: Natural Language Generation, Hallucination Mitigation, Vision-Language Models, Image Captioning
Submission Number: 1508
Loading