Beyond Textual Claims: Strategy for Multimodal Fact Checking with Unified Visual, Textual, and Contextual Representations

ACL ARR 2025 July Submission333 Authors

27 Jul 2025 (modified: 30 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The growing rate of multimodal misinformation, where claims are supported by both text and images poses significant challenges to fact-checking systems that rely primarily on textual evidence. In this work, we propose a unified framework for fine-grained multimodal fact verification called "MultiCheck", designed to reason over structured textual and visual signals. Our architecture combines dedicated encoders for text and images with a fusion module that captures cross-modal relationships using element-wise interactions. A classification head then predicts the veracity of a claim, supported by a contrastive learning objective that encourages semantic alignment between claim-evidence pairs in a shared latent space. We evaluate our approach on the Factify 2 dataset, achieving a weighted F1 score of 0.84, substantially outperforming the baseline. These results highlight the effectiveness of explicit multimodal reasoning and demonstrate the potential of our approach for scalable and interpretable fact-checking in complex, real-world scenarios.
Paper Type: Long
Research Area: Computational Social Science and Cultural Analytics
Research Area Keywords: Multimodal , fact checking, contrastive learning
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: English
Submission Number: 333
Loading