Abstract: Multimodal sentiment analysis has garnered increasing attention. The bulk of existing work in multimodal sentiment analysis primarily focuses on designing various networks to align and subsequently fuse representations from individual modalities. Contrastive learning, recognized for its intrinsic alignment capabilities, has also been extensively applied in multimodal sentiment analysis. However, current contrastive learning methods are often limited to pairwise modalities and typically perform contrastive learning prior to modality fusion, neglecting the consistency of interactions across multiple modalities. Moreover, they overlook the overall consistency within samples. To address these issues, we introduce a novel Multi-Level Contrastive Learning (MLCL) framework for multimodal sentiment analysis, composed of Uni-Modal Contrastive Learning (UMCL), Bi-Modal Contrastive Learning (BMCL) and Tri-Modal Contrastive Learning (TMCL). UMCL enhances intra-modal representations by creating positive pairs using modality-specific random dropout, while BMCL leverages the asymmetry of attention mechanisms, using two directional attentions as positive samples. TMCL aligns non-overlapping uni-modal and bi-modal representations, underscoring the complementarity of tri-modal information. The effectiveness of MLCL is demonstrated through its performance on multiple datasets. Our comprehensive experiments across multiple datasets demonstrate the superiority of the MLCL framework, which achieves new state-of-the-art performance.
External IDs:doi:10.1109/tmm.2025.3613116
Loading