Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The advancement of Large Vision-Language Models (LVLMs) has increasingly highlighted the critical issue of their tendency to hallucinate non-existing objects in the images. To address this issue, previous works focused on using specially curated datasets or powerful LLMs to rectify the outputs of LVLMs. However, these approaches require either costly training or fine-tuning, or API access to proprietary LLMs for post-generation correction. In response to these limitations, we propose Mitigating hallucinAtion via image-gRounded guIdaNcE (MARINE), a framework that is both training-free and API-free. MARINE effectively and efficiently reduces object hallucinations during inference by introducing image-grounded guidance to LVLMs. This is achieved by leveraging open-source vision models to extract object-level information, thereby enhancing the precision of LVLM-generated content. Our framework's flexibility further allows for the integration of multiple vision models, enabling more reliable and robust object-level guidance. Through comprehensive evaluations across 5 popular LVLMs with diverse evaluation metrics and benchmarks, we demonstrate the effectiveness of MARINE, which even outperforms existing fine-tuning-based methods. Remarkably, it reduces hallucinations consistently in GPT-4V-assisted evaluation while maintaining the detailedness of LVLMs' generations. We release our code at https://github.com/Linxi-ZHAO/MARINE.
Lay Summary: Imagine asking an AI to describe a photo—and it confidently tells you there’s a dog on the beach… except there’s no dog at all. These hallucinations, where AI models describe objects that don’t actually exist in the image, are surprisingly common in today’s powerful vision-language models (LVLMs). We introduce MARINE, a lightweight and practical fix to this problem. Instead of relying on expensive retraining or proprietary commercial tools, MARINE takes a smarter approach: it uses open-source vision models to help LVLMs “see” what’s really in the image before responding. Think of it as giving the AI a grounded reality check. MARINE is plug-and-play—it works out of the box with many existing models, requires no retraining, and significantly reduces hallucinations without sacrificing the quality or detail of the answers.
Link To Code: https://github.com/Linxi-ZHAO/MARINE
Primary Area: Deep Learning->Large Language Models
Keywords: Large Vision-Language Models, Object Hallucination, Multi-modal LLMs
Submission Number: 14554
Loading