VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Zongxia Li; Xiyang Wu; Guangyao Shi; Yubin Qin; Hongyang Du; Tianyi Zhou; Dinesh Manocha; Jordan Lee Boyd-Graber

VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Tianyi Zhou, Dinesh Manocha, Jordan Lee Boyd-Graber

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-language Model, Video Understanding, Video Generation, Hallucination, Diffusion Models Monitoring

TL;DR: We present VideoHallu, a benchmark of over 3,000 synthetic videos with expert-crafted counterintuitive QA pairs, evaluating MLLMs' ability to detect perceptually obvious abnormalities often missed due to language priors.

Abstract: Vision Language models (VLMs) have achieved remarkable success in video understanding tasks. Yet, a key question remains: Do they comprehend visual information or merely learn superficial mappings between visual and textual patterns? Understanding visual cues, particularly those related to physics and common sense, is crucial for AI systems interacting with the physical world. However, existing VLM evaluations primarily rely on positive-control tests using real-world videos that resemble training distributions. While VLMs perform well on such benchmarks, it is unclear whether they grasp underlying visual and contextual signals or simply exploit visual-language correlations. To fill this gap, we propose incorporating negative-control tests, i.e., videos depicting physically impossible or logically inconsistent scenarios, and evaluating whether models can recognize these violations. True visual understanding should evince comparable performance across both positive and negative tests. Since such content is rare in the real world, we introduce VideoHallu, a synthetic video dataset featuring physics- and commonsense-violating scenes generated using state-of-the-art tools such as Veo2, Sora, and Kling. The dataset includes expert-annotated question-answer pairs spanning four categories of physical and commonsense violations, designed to be straightforward for human reasoning. We evaluate several leading VLMs, including Qwen-2.5-VL, Video-R1, and VideoChat-R1. Despite their strong performance on real-world benchmarks (e.g., MVBench, MMVU), these models hallucinate or fail to detect physical or logical violations, revealing fundamental weaknesses in visual understanding. Finally, we explore reinforcement learning-based post-training on our negative dataset: fine-tuning improves performance on VideoHallu without degrading results on standard benchmarks, indicating enhanced visual reasoning in VLMs. Our data is available at https://github.com/zli12321/VideoHallu.git.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 6899

Loading