VidHal: Benchmarking Hallucinations in Vision LLMs

10 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY 4.0
Keywords: hallucination, large language models, vision language models
Abstract: Vision Large Language Models (VLLMs) are widely acknowledged to be prone to hallucinations. Existing research addressing this problem has primarily been confined to image inputs, with sparse exploration of their video-based counterparts. Furthermore, current evaluation methods fail to capture nuanced errors in generated responses, which are often exacerbated by the rich spatiotemporal dynamics of videos. To address these two limitations, we introduce \textsc{VidHal}, a benchmark specially designed to evaluate video-based hallucinations in VLLMs. \textsc{VidHal} is constructed by bootstrapping video instances across a wide range of common temporal aspects. A defining feature of our benchmark lies in the careful creation of captions which represent varying levels of hallucination associated with each video. To enable fine-grained evaluation, we propose a novel caption ordering task requiring VLLMs to rank captions by hallucinatory extent. We conduct extensive experiments on \textsc{VidHal} and comprehensively evaluated a broad selection of models, including both open-source and proprietary ones such as GPT-4o. Our results uncover significant limitations in existing VLLMs with respect to video-based hallucination generation. Through our benchmark, we aim to inspire further research on i) holistic understanding of VLLM capabilities, particularly regarding hallucination, and ii) advancing VLLMs to alleviate this problem.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/anon-axolotl/VidHal
Code URL: https://github.com/anon-axolotl/VidHal
Supplementary Material: zip
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Flagged For Ethics Review: true
Submission Number: 1234
Loading