Abstract: Fake news is usually disseminated in a multimodal form, which incorporates natural language, visual language, and so on. Therefore, many deep learning approaches are proposed to detect multimodal fake news. However, a drawback of existing methods is that they simply fuse unimodal features and ignore the latent semantic alignment of image and text modalities. In this paper, we propose a novel Multimodal Stacked Cross Attention Network (MSCA) to better align and fuse multimodal token-level textual and visual features for fake news detection. Experiments conducted on two publicly available datasets show that our method can significantly improve performance compared with other models. Furthermore, experimental analysis shows that MSCA can effectively align and fuse token-level features of multiple modalities.
Loading