VCGD: Visual Cue Guided Decoding with Caption Model for Mitigating Hallucination in Multimodal Large Language Models

ACL ARR 2025 May Submission4525 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multimodal large language models (MLLMs) demonstrate strong capabilities in multimodal understanding, reasoning, and interaction but still face the fundamental limitation of hallucinations, where they generate erroneous or fabricated information. We propose **V**isual **C**lue-**G**uided **D**ecoding (**VCGD**), a novel decoding strategy that incorporates precise visual cues generated by a Caption Model during the decoding phase. These cues serve as comparative references for the model’s own outputs, effectively mitigating hallucination phenomena. Specifically, VCGD leverages highquality visual descriptions to guide MLLMs in correcting perceptual biases while generating answers. Furthermore, we introduce a Reinforcement Learning (RL)-based training paradigm for the Caption Model, in which a Reward Agent provides feedback on the quality of visual clues, further enhancing the accuracy of visual information. Extensive experiments across multiple benchmark datasets and state-of-the-art MLLMs demonstrate that VCGD significantly reduces hallucination rates and substantially improves cross-modal consistency. Our method exhibits strong generalizability and scalability, offering an effective decoding enhancement strategy that can be seamlessly integrated into existing multimodal frameworks. Code is available at https://anonymous.4open.science/r/VCGD-C860.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodal large language models, hallucinations, visual clue-guided decoding
Languages Studied: English
Submission Number: 4525
Loading