Keywords: Language and Vision, Hallucination
TL;DR: Mitigating Hallucination in Large Vision-Language Models
Abstract: Recent advances in large vision-language models (LVLMs) have enabled powerful multimodal reasoning by integrating visual encoders with large language models (LLMs). However, their reliability is frequently undermined by hallucinations—generated text that inaccurately describes the visual input. Although fine-tuning can mitigate this, it is computationally expensive and demands large, curated datasets, making training-free alternatives more appealing. Among training-free strategies, model-editing is a more promising solution than decoding-based approaches. While decoding methods can adapt outputs per-input, they introduce substantial computational overhead and instability. Model-editing, by contrast, modifies the model's internal representations offline, offering a more efficient and stable framework. However, the effectiveness of current model-editing techniques is limited. Existing methods typically rely on a single, global subspace to correct errors. This static, one-size-fits-all approach treats all test samples identically, failing to capture the diverse modes of hallucination that vary from one input to another. To overcome this specific limitation, we propose a training-free hallucination mitigation framework that performs dynamic, per-instance suppression at test time. Our method advances the model-editing paradigm by first constructing a set of Disentangled Hallucination Subspaces, where each subspace isolates a distinct hallucination mode. Then, at inference, our model adaptively calculates weights to determine how a given input relates to each subspace. These weights guide a dynamically combined projection that selectively suppresses the most probable hallucination directions for that specific instance while preserving image-grounded semantics. Extensive experiments across multiple vision-language benchmarks and LVLMs families demonstrate consistent improvements, highlighting the robustness, generalizability, and efficiency of our approach.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6379
Loading