Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models
Abstract: Large Vision-Language Models (LVLMs) exhibit remarkable capabilities but struggle with "hallucinations"—inconsistencies between images and their descriptions. Previous hallucination evaluation studies on LVLMs have identified hallucinations in terms of objects, attributes, and relations but overlooked complex hallucinations that create an entire narrative around a fictional entity. In this paper, we introduce a refined taxonomy of hallucinations, featuring a new category: Event Hallucination. We then utilize advanced LLMs to generate and filter fine-grained hallucinatory data consisting of various types of hallucinations, with a particular focus on event hallucinations, laying the groundwork for integrating discriminative and generative evaluation methods within our universal evaluation framework. The proposed benchmark distinctively assesses LVLMs' ability to tackle a broad spectrum of hallucinations, making it a reliable and comprehensive tool for gauging LVLMs' efficacy in handling hallucinations. We will release our code and data.
Primary Subject Area: [Generation] Multimedia Foundation Models
Secondary Subject Area: [Generation] Multimedia Foundation Models, [Content] Vision and Language
Relevance To Conference: This work contributes significantly to multimedia/multimodal processing by addressing a critical challenge in Large Vision-Language Models (LVLMs): hallucinations, which are inconsistencies between images and their descriptions. By introducing a refined taxonomy of hallucinations that includes the novel category of Event Hallucination, our study provides a deeper understanding of LVLMs' performance across a more diverse array of hallucination types. Specifically, our focus on event hallucinations—a complex form that involves creating narratives around fictional entities—enriches the field's approach to evaluating and mitigating discrepancies in multimodal systems. Through the use of advanced Language Models (LMs) to generate and filter fine-grained hallucinatory data, we lay the groundwork for integrating both discriminative and generative evaluation methods. This integration within our universal evaluation framework offers a nuanced, comprehensive tool for assessing LVLMs’ efficacy in handling multimodal hallucinations. The release of our code and data furthers the field by providing resources for ongoing research and development in multimedia/multimodal processing, facilitating advancements in creating more coherent, accurate, and reliable LVLMs. This contribution is essential for the development of LVLMs that can better understand and generate human-like, contextually accurate multimodal content.
Supplementary Material: zip
Submission Number: 1093
Loading