Multi-Level Contrastive Learning for Multimodal Sentiment Analysis

Yan Zhuang, Wei Bai, Yanru Zhang, Jiawen Deng, Zheng Hu, Xiaoyue Zhang, Fuji Ren

Published: 01 Jan 2025, Last Modified: 23 Nov 2025IEEE Transactions on MultimediaEveryoneRevisionsCC BY-SA 4.0

Abstract: Multimodal sentiment analysis has garnered increasing attention. The bulk of existing work in multimodal sentiment analysis primarily focuses on designing various networks to align and subsequently fuse representations from individual modalities. Contrastive learning, recognized for its intrinsic alignment capabilities, has also been extensively applied in multimodal sentiment analysis. However, current contrastive learning methods are often limited to pairwise modalities and typically perform contrastive learning prior to modality fusion, neglecting the consistency of interactions across multiple modalities. Moreover, they overlook the overall consistency within samples. To address these issues, we introduce a novel Multi-Level Contrastive Learning (MLCL) framework for multimodal sentiment analysis, composed of Uni-Modal Contrastive Learning (UMCL), Bi-Modal Contrastive Learning (BMCL) and Tri-Modal Contrastive Learning (TMCL). UMCL enhances intra-modal representations by creating positive pairs using modality-specific random dropout, while BMCL leverages the asymmetry of attention mechanisms, using two directional attentions as positive samples. TMCL aligns non-overlapping uni-modal and bi-modal representations, underscoring the complementarity of tri-modal information. The effectiveness of MLCL is demonstrated through its performance on multiple datasets. Our comprehensive experiments across multiple datasets demonstrate the superiority of the MLCL framework, which achieves new state-of-the-art performance.

External IDs:doi:10.1109/tmm.2025.3613116