SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding
TL;DR: We propose SECOND (Selective and Contrastive Decoding) to tackle perceptual hallucination in LVLMs by iteratively selecting and contrasting multi-scale visual information.
Abstract: Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.
Lay Summary: Vision-Language Models (VLMs), which combine image understanding with language generation, have made impressive progress in tasks like image captioning or visual question answering. However, they often suffer from perceptual hallucination — generating descriptions that either overlook objects clearly present in the image or mention objects that are not there at all. This leads to confusing or incomplete answers and limits their trustworthiness in critical applications.
We developed SECOND: Selective and Contrastive Decoding, a new method that helps these models better focus on what’s actually visible in an image. Unlike other approaches that treat all image areas equally, SECOND imitates how humans look at images: starting with a rough overview, then zooming in on important parts. SECOND gradually filters and refines visual details through multiple stages, helping the model ignore irrelevant parts of the image and focus on meaningful ones.
SECOND works without any extra training and is compatible with existing models. It also includes a novel comparison step, where outputs at different stages are contrasted to reduce errors. Across multiple benchmarks, our method significantly reduced hallucination and improved object recognition.
By making AI models see more like humans, SECOND moves us one step closer to trustworthy and accurate visual reasoning.
Primary Area: Applications->Computer Vision
Keywords: Vision-Language Model, Hallucination, Selective and Contrastive Decoding
Submission Number: 3212
Loading