VidHal: Benchmarking Hallucinations in Vision LLMs

Wey Yeh Choong; Yangyang Guo; Mohan Kankanhalli

VidHal: Benchmarking Hallucinations in Vision LLMs

Wey Yeh Choong, Yangyang Guo, Mohan Kankanhalli

10 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: hallucination, large language models, vision language models

Abstract: Vision Large Language Models (VLLMs) are widely acknowledged to be prone to hallucinations. Existing research addressing this problem has primarily been confined to image inputs, with sparse exploration of their video-based counterparts. Furthermore, current evaluation methods fail to capture nuanced errors in generated responses, which are often exacerbated by the rich spatiotemporal dynamics of videos. To address these two limitations, we introduce \textsc{VidHal}, a benchmark specially designed to evaluate video-based hallucinations in VLLMs. \textsc{VidHal} is constructed by bootstrapping video instances across a wide range of common temporal aspects. A defining feature of our benchmark lies in the careful creation of captions which represent varying levels of hallucination associated with each video. To enable fine-grained evaluation, we propose a novel caption ordering task requiring VLLMs to rank captions by hallucinatory extent. We conduct extensive experiments on \textsc{VidHal} and comprehensively evaluated a broad selection of models, including both open-source and proprietary ones such as GPT-4o. Our results uncover significant limitations in existing VLLMs with respect to video-based hallucination generation. Through our benchmark, we aim to inspire further research on i) holistic understanding of VLLM capabilities, particularly regarding hallucination, and ii) advancing VLLMs to alleviate this problem.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/anon-axolotl/VidHal

Code URL: https://github.com/anon-axolotl/VidHal

Supplementary Material: zip

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Flagged For Ethics Review: true

Submission Number: 1234

Loading