Keywords: Hallucination Mitigation, Natural Language Generation, Large Vision-Language Models, Decoding Strategies
TL;DR: We revisit hallucination arising from visual encoding in Large Vision-Language Models and propose a multi-scale visual alignment decoding framework.
Abstract: Large Vision-Language Models (LVLMs) face the persistent challenge of object hallucination, where models generate descriptions of objects that do not exist in the input image. In this work, we revisit hallucination arising from visual encoding and show that, while moderate resolution scaling alleviates hallucination, a similar context-induced effect, previously observed in text generation, also emerges in the visual domain: excessive visual tokens diffuse attention and reintroduce hallucination. To address these issues, we propose a multi-scale visual alignment decoding framework, which supplements visual knowledge at multiple granularities while ensuring attention remains focused on the correct regions, thereby mitigating context-driven hallucination. Extensive experiments demonstrate that our approach substantially reduces object hallucination and achieves stronger image-text alignment than state-of-the-art methods.
Primary Area: generative models
Submission Number: 16963
Loading