Keywords: vision-language models;visual hallucination
Abstract: Vision-Language Models (VLMs) have achieved impressive progress across a range of multimodal tasks but remain highly susceptible to visual hallucination, producing text that contradicts the visual input. Existing mitigation strategies often rely on additional large-scale VLMs or multi-stage decoding, which hinders efficiency and broad applicability. In this work, we identify redundant and noisy image features as a primary cause of hallucination, as they degrade the model’s ability to capture semantically relevant visual content. Correspondingly, we propose VIBRA (Vision-Language Information Bottleneck with Redundancy Awareness), a lightweight and plug-and-play module that adaptively filters out redundant visual information while preserving task-relevant semantics at both the token and feature levels. Specifically, VIBRA employs a multi-modal information bottleneck to retain image features aligned with textual input and introduces adaptive token filtering through spectral clustering and compression-aware pruning to eliminate instance-specific redundancy. Additionally, we design a Binary-Guided loss to sharpen the separation between informative and noisy features, enabling more effective visual information gating. Extensive experiments demonstrate that VIBRA consistently enhances visual reasoning and reduces hallucination across a variety of VLM architectures.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6822
Loading