Abstract: We propose a novel task, hallucination localization in video captioning, which aims to identify hallucinations in video captions at the span level (i.e. individual words or phrases). This allows for a more detailed analysis of hallucinations compared to existing sentence-level hallucination detection task. We manually annotate 1,167 hallucination instances from VideoLLM-generated captions to build HLVC-Dataset, a specialized dataset for hallucination localization. We further implement a VideoLLM-based baseline method and conduct quantitative and qualitative evaluations to benchmark current performance on hallucination localization.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond, Resources and Evaluation
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 2273
Loading