Abstract: Multimodal sentiment analysis (MSA) aims to integrate multiple modalities of information to better understand human sentiment. The current research mainly focuses on conducting multimodal fusion and representation learning, which neglects the under-optimized modal representations generated by the imbalance of unimodal performances in joint learning. Moreover, the size of labeled datasets limits the generalization ability of existing supervised models used in MSA. To address the above issues, this paper proposes a knowledge-enhanced self-supervised balanced representation approach (KEBR) to capture common sentimental knowledge in unlabeled videos and explore the optimization issue of information imbalance between modalities. First, a text-based cross-modal fusion method (TCMF) is constructed, which injects the non-verbal information from the videos into the semantic representation of text to enhance the multimodal representation of text. Then, a multimodal cosine constrained loss (MCC) is designed to constrain the fusion of non-verbal information in joint learning to balance the representation of multimodal information. Finally, with the help of sentiment knowledge and non-verbal information, KEBR conducts sentiment word masking and sentiment intensity prediction, so that the sentiment knowledge in the videos is embedded into the pre-trained multimodal representation in a balanced manner. Experimental results on two publicly available datasets MOSI and MOSEI show that KEBR significantly outperforms the baseline, achieving new state-of-the-art results.
Primary Subject Area: [Engagement] Emotional and Social Signals
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: This paper proposes a knowledge-enhanced self-supervised balanced representation approach for multimodal sentiment analysis. The main contributions of this work to multimedia/multimodal processing are as follows:
1) This paper proposes a self-supervised learning method for multimodal sentiment analysis, which uses sentiment knowledge from large-scale unlabeled multimedia videos to facilitate sentiment representation learning. This multimodal pre-training method for sentiment enhancement can effectively solve the dependency problem of existing supervised models on labeled datasets in multimodal sentiment analysis tasks.
2) This paper proposes a text-based cross-modal fusion approach, highlighting the dominance of text and the supplement of non-verbal in multimodal sentiment analysis tasks. This approach can be easily applied to different multimodal tasks, constructing a multimodal fusion approach with the core modality as the main and other modalities as auxiliary.
3) This paper proposes a multimodal cosine constrained loss function (MCC) to mitigate the imbalance of unimodal in joint representation. MCC is designed as an external constraint with almost no additional training cost and is independent of the model or architecture. Therefore, MCC can be applied to different multimodal tasks to mitigate the problem that some modalities are neglected in multimodal fusion due to modal gaps.
Supplementary Material: zip
Submission Number: 2834
Loading