EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding

EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding

ACL ARR 2025 May Submission6673 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in complex multimodal tasks. While MLLMs excel at visual perception and reasoning in third-person and egocentric videos, they are prone to hallucinations, generating coherent yet inaccurate responses. We present EGOILLUSION, a first benchmark to evaluate MLLM hallucinations in egocentric videos. EGOILLUSION comprises 1,400 videos paired with 8,000 human-annotated open and closed-ended questions designed to trigger hallucinations in both visual and auditory cues in egocentric videos. Evaluations across ten MLLMs reveal significant challenges, including powerful models like GPT-4o and Gemini, achieving only 59% accuracy. EGOILLUSION lays the foundation in developing robust benchmarks to evaluate the effectiveness of MLLMs and spurs the development of better egocentric MLLMs with reduced hallucination rates. Our benchmark will be open-sourced for reproducibility

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: multimodality, video processing

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Data analysis, Position papers

Languages Studied: english

Submission Number: 6673

Loading