Mitigating Object Hallucinations in Large Vision-Language Models via multi-scale visual integration

Mitigating Object Hallucinations in Large Vision-Language Models via multi-scale visual integration

ICLR 2026 Conference Submission16963 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Hallucination Mitigation, Natural Language Generation, Large Vision-Language Models, Decoding Strategies

TL;DR: We revisit hallucination arising from visual encoding in Large Vision-Language Models and propose a multi-scale visual alignment decoding framework.

Abstract: Large Vision-Language Models (LVLMs) face the persistent challenge of object hallucination, where models generate descriptions of objects that do not exist in the input image. In this work, we revisit hallucination arising from visual encoding and show that, while moderate resolution scaling alleviates hallucination, a similar context-induced effect, previously observed in text generation, also emerges in the visual domain: excessive visual tokens diffuse attention and reintroduce hallucination. To address these issues, we propose a multi-scale visual alignment decoding framework, which supplements visual knowledge at multiple granularities while ensuring attention remains focused on the correct regions, thereby mitigating context-driven hallucination. Extensive experiments demonstrate that our approach substantially reduces object hallucination and achieves stronger image-text alignment than state-of-the-art methods.

Primary Area: generative models

Submission Number: 16963

Loading