Abstract: The widespread prevalence of misinformation poses sig-nificant societal concerns. Out-of-context misinformation, where authentic images are paired with false text, is partic-ularly deceptive and easily misleads audiences. Most existing detection methods primarily evaluate image-text consis-tency but often lack sufficient explanations, which are essen-tial for effectively debunking misinformation. We present a model that detects multimodal misinformation through cross-modality consistency checks, requiring minimal training time. Additionally, we propose a lightweight model that achieves competitive performance using only one-third of the parameters. We also introduce a dual-purpose zero-shot learning task for generating contextualized warnings, enabling automated debunking and enhancing user compre-hension. Qualitative and human evaluations of the generated warnings highlight both the potential and limitations of our approach.
External IDs:dblp:conf/wacv/DelvecchioNE25
Loading