Abstract: Large pre-trained models (PLMs) have provided tremendous opportunities and potentialities for multimodal fake news detection. However, existing multimodal fake news detection methods never manipulate the token-wise hierarchical semantics of news yielded from PLMs and extremely rely on contrastive learning but ignore the symmetry between text and image in terms of the abstract level. This paper proposes a novel multimodal fake news detection method that helps to balance the understanding between text and image via (1) designing a global-token across-attention mechanism to capture the correlations between global text and tokenwise image representations (or tokenwise text and global image representations) obtained from BERT and ViT; (2) proposing a QK-sharing strategy within cross-attention to enforce model symmetry that reduces information redundancy and accelerates fusion without sacrificing representational power; (3) deploying a semantic augmentation module that systematically extracts token-wise multilayered text semantics from stacked BERT blocks via CNN and Bi-LSTM layers, thereby rebalancing abstract-level disparities by symmetrically enriching shallow and deep textual signals. We also prove the effectiveness of our approach by comparing it with four state-of-the-art baselines. All the comparisons were conducted using three widely adopted multimodal fake news datasets. The results show that our approach outperforms the benchmarks by 0.8% in accuracy and 2.2% in F1-score on average across the three datasets, which demonstrates a symmetric, token-centric fusion of fine-grained semantic fusion, thereby driving more robust fake news detection.
External IDs:doi:10.3390/sym17060961
Loading