LGCMNet: Multimodal Sentiment Analysis Network Based on Language-Guided Cross-Modal Interaction

Yao Wang, Minghua Nuo, Yuan Zhang, Xiaoyu Jia

Published: 2024, Last Modified: 24 Apr 2026ICONIP (9) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multimodal fusion is a hot research topic in multimodal sentiment analysis (MSA), and recent fusion methods aim to achieve data complementarity through the interaction of different modalities. However, these methods face challenges in dealing with complex sentiment expressions. They also have limitations in the efficiency of cross-modal interactions and ignore the differences in semantic information density across modalities. To address this problem, we propose the Language-Guided Cross-Modal Fusion Network (LGCMNet). Firstly, by extracting features at different levels using different pre-trained models, LGCMNet enhances the representations of audio and visual. Furthermore, we designed the Language-Guided Enhanced (LGE) layer and Interactive Cross Attention (ICA) layer fusion networks. The LGE layer is applied for interactions between non-text modalities, while the ICA layer is used for the fusion of text with non-text modalities. The two layers utilize text information to guide inter-modal interactions, ensuring effective inter-modal information transfer and enhancement. Through experimental verifications, our LGCMNet model achieves SOTA performance compared with the baseline model on CMU-MOSI and CMU-MOSEI datasets.