No Evidence, No Problem: When Less is More for Out of Context Multimodal Misinformation Detection

ACL ARR 2025 July Submission524 Authors

28 Jul 2025 (modified: 07 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The proliferation of multimodal misinformation, particularly Out-of-Context (OOC) image-text mismatches, poses significant challenges for reliable information verification. Existing detection approaches often rely on unimodal signals, limiting their capacity to capture nuanced cross-modal inconsistencies. Although recent multimodal methods have improved performance, many depend on large-scale architectures or external web evidence, which hinders scalability and practical deployment. In this work, we introduce a lightweight and evidence-free framework for OOC misinformation detection that achieves competitive performance with high efficiency. Our approach enhances visual understanding by integrating semantic entity extraction and generated visual captions, which are fused with the accompanying textual caption and input to a prompt-tuned Flan-T5 model. Simultaneously, a fine-tuned CLIP model evaluates image-text alignment. The outputs of both models are combined via a validation-optimized weighted ensemble. Extensive experiments on the NewsCLIPpings dataset demonstrate that our method achieves state-of-the-art accuracy among evidence-free techniques, while offering low computational overhead and strong interpretability, making it well-suited for real-world applications.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: image text matching, cross-modal pretraining, cross-modal application, cross-modal information extraction, multimodality
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 524
Loading