everyone
since 20 Jul 2024">EveryoneRevisionsBibTeXCC BY 4.0
Multimodal sentiment analysis (MSA) aims to integrate multiple modalities of information to better understand human sentiment. The current research mainly focuses on conducting multimodal fusion and representation learning, which neglects the under-optimized modal representations generated by the imbalance of unimodal performances in joint learning. Moreover, the size of labeled datasets limits the generalization ability of existing supervised models used in MSA. To address the above issues, this paper proposes a knowledge-enhanced self-supervised balanced representation approach (KEBR) to capture common sentimental knowledge in unlabeled videos and explore the optimization issue of information imbalance between modalities. First, a text-based cross-modal fusion method (TCMF) is constructed, which injects the non-verbal information from the videos into the semantic representation of text to enhance the multimodal representation of text. Then, a multimodal cosine constrained loss (MCC) is designed to constrain the fusion of non-verbal information in joint learning to balance the representation of multimodal information. Finally, with the help of sentiment knowledge and non-verbal information, KEBR conducts sentiment word masking and sentiment intensity prediction, so that the sentiment knowledge in the videos is embedded into the pre-trained multimodal representation in a balanced manner. Experimental results on two publicly available datasets MOSI and MOSEI show that KEBR significantly outperforms the baseline, achieving new state-of-the-art results.