Abstract: Multimodal sentiment analysis aims to predict the sentiment of text or video with the help of other modalities, such as acoustic and vision. Most of previous studies mainly focus on learning the joint representation of multiple modalities after coding by RNN-based network and ignore the mischief of noisy modal information to the fusion of multimodal information. Moreover, the multimodal information cannot be effectively fused when it appears asynchronously. To address these limitations, in this paper, we propose a Noise Filtering and CrossModal Fusion network (NFCMF) to better obtain the fusion of multiple modalities. We conduct several experiments on the CMU-MOSI, CMU-MOSEI and YouTube datasets. The competitive experimental results and qualitative analysis demonstrate the effectiveness of our model.
External IDs:dblp:conf/ialp/SuHLLY21
Loading