OmniHallu: Unified Hallucination Detection for Cross-Modal Comprehension and Generation in Multimodal Large Language Models

ACL ARR 2025 February Submission7847 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While recent Multimodal Large Language Models (MLLMs) have made exciting strides in various tasks and scenarios, they suffer from a significant issue of hallucinations, where generated outputs contradict or misrepresent input semantics. Existing research often focuses on either comprehension or generation tasks within specific modalities, which restricts the generalizability of hallucination studies in MLLMs. To bridge this gap, we introduce OmniHallu, a unified hallucination detection and evaluation framework for cross-modal comprehension and generation in MLLMs. We present a unified benchmark, OmniHallu-Bench, for evaluating both comprehension and generation tasks across modalities, covering text-to-image (T2I), text-to-video (T2V), text-to-audio (T2A), as well as image-to-text (I2T), video-to-text (V2T), and audio-to-text (A2T) processes. Additionally, we propose a novel multi-agent hallucination detection architecture that automatically decomposes and verifies claims, facilitating structured hallucination assessment. Extensive evaluations and analysis demonstrate the effectiveness of our methods, establishing a robust foundation for hallucination detection in MLLMs. This work contributes toward building more reliable and interpretable multimodal AI systems. We will release our source code and data in the camera-ready version.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, cross-modal information extraction, cross-modal application
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 7847
Loading