HVLM: Hierarchical Visual-Language Models are Excellent Decision-makers for Multimodal Fake News Detection

ACL ARR 2025 February Submission8064 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Existing multimodal fake news detection methods based on traditional small models are prone to learn superficial features while struggling to perform knowledge-based reasoning and truly perceive fine-grained image-text consistency. Recently, fueled by large language models and multimodal pretraining techniques, large vision-language models (LVLMs) has seen significant progress in these aspects, which motivate us to transfer them for multimodal fake news detection. Specifically, barely a small LVLM (sLVLM) Qwen2-vl-2b as the multimodal fusion module even significantly outperforms existing methods. However, we still find two weaknesses within it:1) insufficient learning of low-level visual features; 2) difficulty in knowledge-based reasoning from a macro perspective. For the former problem, we employ an additional smaller VLM, i.e., the CLIP, as a visual-enhanced module to mitigate the weakness of the sLVLM in visual perception. For the latter problem, multi-perspective prompts are used to elicit high-level rationales from a larger un-tuned LVLM Qwen2-vl-72B, which are then explicitly concatenated into the input of the sLVLM as supplementary features. The three-tier framework of CLIP-sLVLM-LVLM forms our proposed Hierarchical Visual-Language Models (HVLM). Extensive experiments on three public datasets demonstrate the significant effectiveness and generalization ability of our proposed framework.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: rumor/misinformation detection, multimodal applications
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English, Chinese
Submission Number: 8064
Loading