Hallusionbench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Published: 01 Jan 2024, Last Modified: 13 Dec 2024CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We introduce “HALLUSIONBENCH 1 1 “Hallusion” is a portmanteau of “hallucination” and “illusion.”,” a comprehensive benchmark designed for the evaluation of image-context rea-soning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(ision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpre-tation of visual data. The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies, logical consistency, and various failure modes. In our evaluation on Hallusion-bench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve accuracy below 16%. Moreover, our analysis not only high-lights the observed failure modes, including language hal-lucination and visual illusion but also deepens an under-standing of these pitfalls. Our comprehensive case studies within Hallusionbench shed light on the challenges of hallucination and illusion in LVLMs. Based on these in-sights, we suggest potential pathways for their future im-provement. The benchmark and codebase can be accessed at https://github.com/tianyi-labIHallusionBench.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview