From Predictions to Analyses: Rationale-Augmented Fake News Detection with Large Vision-Language Models
Abstract: The rapid development of social media has led to a surge of eye-catching fake news on the Internet, with multimodal news comprising both images and text being particularly prevalent. To address the challenges of Multimodal Fake News Detection (MFND), numerous supervised task-specific Multimodal Small Language Models (MSLMs) have been developed. However, these models lack the breadth of knowledge and the depth of language understanding, which results in unsatisfactory adaptability, generalization, and explainability performance. To address these issues, we attempt to introduce Large Vision-Language Models (LVLMs), aiming to leverage the common sense understanding and logical reasoning abilities of LVLMs for the MFND task. We observed that LVLMs can generate reasonable analyses of news content from specific angles. However, when it comes to synthesizing these analyses for final judgment, their performance declines significantly, failing to meet the accuracy benchmarks set by existing MSLMs detection models. This reflects the need for a more effective way for LVLMs, which have not undergone task-specific training, to utilize their knowledge and capabilities. Based on these findings, we propose the Explainable Adaptive Rationale-Augmented Multimodal (EARAM) framework, which adaptively uses MSLMs to extract useful rationales from the multi-perspective analyses of LVLMs. After making judgments based on these rationales, EARAM then assists LVLMs in generating more reliable explanations. Extensive experiments demonstrate that our model not only achieves state-of-the-art results on widely used datasets but also significantly outperforms other models in terms of generalization and explainability.
Loading