WildVideo: Benchmarking LMMs for Understanding Video-Language Interaction

Published: 2025, Last Modified: 21 Jan 2026IEEE Trans. Pattern Anal. Mach. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We introduce WildVideo, an open-world benchmark dataset designed to address how to assess hallucination of Large Multi-modal Models (LMMs) for understanding video-language interaction in the wild. Our WildVideo comprehensively tests the perceptual, cognitive, and contextual comprehension hallucination of LMMs through both single-turn and multi-turn open-ended question-answering (QA) tasks on videos captured from two human perspectives (i.e. first-person view and third-person view). We define 9 distinct tasks that challenge LMMs across multi-level perceptual tasks (e.g., static and dynamic perception), multi-aspect cognitive tasks (e.g., commonsense, world knowledge), and multi-faceted contextual comprehension tasks (e.g., contextual ellipsis, cross-turn retrieval). The benchmark consists of 1,318 meticulously curated videos, supplemented with 13,704 single-turn QA pairs and 1,585 multi-turn dialogues (up to 5 turns). We evaluated 14 commonly-used LMMs on WildVideo, revealing significant hallucination issues of current LMMs, highlighting substantial gaps in their current capabilities.
Loading