VISCON: Identifying and Benchmarking Vision Hallucination for Large Vision-Language Model

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: hallucination, vision hallucination, large vision-language model, large language model
Abstract: Large Vision-Language Models (LVLMs) have demonstrated exceptional capabilities in a variety of vision-language tasks, but suffer from "vision hallucinations" - a tendency generating text inconsistent with the image. This issue hampers their practical use in real-world applications. To effectively evaluate and detect these hallucinations, we introduce VISCON (VISual Concept cONsistency), a benchmark framework comprising a benchmark image dataset and quantitative evaluation pipelines to assess vision hallucinations in LVLMs. VISCON extends beyond previous hallucination metrics by offering: a) diverse image styles across multiple visual domains, b) evaluation of a broader range of visual concepts, including objects, attributes, and relationships, and c) high annotation density from detailed scene-graph annotations to reduce false negatives. These improvements enable comprehensive analysis of hallucinations related to both domain shifts and concept types and offer more accurate hallucination evaluation. To detect vision hallucinations, we propose two innovative evaluation pipelines within VISCON: an Earth Mover's Distance (EMD)-based pipeline and an "Evaluate-By-Edit" pipeline. The EMD-based pipeline measures the distributional similarity between the reference visual concepts and those mentioned by LVLMs, robust against vocabulary shifts between annotations and natural language responses. The Through extensive experiments on six leading LVLMs, VISCON reveals crucial insights into the nature of vision hallucinations. Our findings indicate that factors such as image domain shifts, complexity of visual concepts and model response length significantly influence the occurrence of hallucinations in LVLM responses. Additionally, human evaluations confirm that VISCON aligns with human preferences better than established hallucination metrics. "Evaluate-By-Edit" focuses on the edit distance between the original LVLM response and a hallucination-reduced version revised according to the rich visual concept annotations, providing an interpretable analysis of hallucinated content. Importantly, our method directly evaluates captioning responses, unlike previous metrics that query the existence of individual visual concepts. This approach is more challenging, as it requires models to handle multiple concepts simultaneously, providing better discrimination of LVLM performance. Through extensive experiments on six leading LVLMs, VISCON reveals crucial insights into the nature of vision hallucinations. Our findings indicate that factors such as image domain shifts, complexity of visual concepts and model response length significantly influence the occurrence of hallucinations in LVLM responses. Additionally, human evaluations confirm that VISCON aligns with human preferences better than established hallucination metrics.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6675
Loading