Hallusionbench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
Abstract: We introduce “HALLUSIONBENCH 1 1 “Hallusion” is a portmanteau of “hallucination” and “illusion.”,” a comprehensive benchmark designed for the evaluation of image-context rea-soning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(ision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpre-tation of visual data. The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies, logical consistency, and various failure modes. In our evaluation on Hallusion-bench, we benchmarked 15 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve accuracy below 16%. Moreover, our analysis not only high-lights the observed failure modes, including language hal-lucination and visual illusion but also deepens an under-standing of these pitfalls. Our comprehensive case studies within Hallusionbench shed light on the challenges of hallucination and illusion in LVLMs. Based on these in-sights, we suggest potential pathways for their future im-provement. The benchmark and codebase can be accessed at https://github.com/tianyi-labIHallusionBench.
Loading