Multimodal Fake News Detection in Bilingual Media: A Vision–Language Fusion Approach Using Task-Specific CNNs and Multilingual Transformers

Multimodal Fake News Detection in Bilingual Media: A Vision–Language Fusion Approach Using Task-Specific CNNs and Multilingual Transformers

ACL ARR 2026 January Submission6170 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: BERT, CNN, XLM-RoBERTa, Fusion Classifier, Late fusion, Bilingual, Fake-news, Image-Text.

Abstract: Misinformation on social media increasingly exploits visually persuasive thumbnail-style content, yet multimodal fake news detection remains underexplored for low-resource languages. This paper presents a bilingual multimodal framework for fake news detection using textual and visual information extracted from Bangla and English social media thumbnails collected from YouTube, Facebook, and Instagram. A manually annotated dataset of 19,890 thumbnail images with embedded textual content is constructed. Multilingual transformer models are fine-tuned using a robust text preprocessing pipeline, while a custom convolutional neural network is designed for visual feature extraction. Unimodal predictions are combined using a late fusion strategy to enhance robustness. Experimental results demonstrate that the XLM-RoBERTa-base model outperforms other multilingual transformer-based models on OCR-extracted text, achieving the highest macro F1-score of 97.20. The Visual Misinformation Detection Convolutional Neural Network (VMD-CNN) model achieves 96.05% accuracy on visual content, confirming that textual cues dominate while visual features provide complementary signals. Late fusion of the best-performing unimodal models further improves overall performance, reaching 97.15% test accuracy, highlighting the effectiveness of decision-level fusion in integrating heterogeneous modalities. This framework provides a practical solution for detecting misleading social media content, particularly in bilingual low-resource settings, and offers a foundation for future research in multilingual multimodal fake news detection and automated content moderation.

Paper Type: Long

Research Area: Multilinguality and Language Diversity

Research Area Keywords: AI, NLP, Multimodality, NLP Applications, Low Resources Methods for NLP, NLP for Social Good

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English and Bengali

Submission Number: 6170

Loading