SLaD: Sub-modal Label-aware Disentanglement for Multimodal Sentiment Analysis

ICLR 2026 Conference Submission16466 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal sentiment analysis, sub-modal tag aware, contrastive learning, modality-invariant representation, multi-label supervision.
TL;DR: The paper proposes SLaD, a multimodal sentiment analysis framework that disentangles shared and modality-specific features via label-aware weighting and specialized losses, achieving state-of-the-art results.
Abstract: Multimodal sentiment analysis (MSA) requires integrating heterogeneous information effectively while addressing inconsistent emotional cues across modalities. However, existing approaches often fail to disentangle modality-invariant and modality-specific representations, leading to suboptimal feature alignment and semantic entanglement, especially when emotional expressions differ across sub-modalities. To address this issue, we propose a Sub-modal Label-aware Disentanglement (SLaD) framework that enhances cross-modal representation learning through a sub-modal label similarity weighting mechanism. Specifically, SLaD defines three structural relationships among sub-modal labels and introduces a hybrid similarity function that integrates structural consistency with numerical similarity. This approach mitigates label noise and conflicts from heterogeneous modality information. We further introduce three complementary losses for joint optimization: (1) a modality contrastive loss that aligns modality-invariant features, (2) a modality repulsive loss that enhances the discriminability of modality-specific features, and (3) a multi-label contrastive loss that captures sub-modal emotional label correlations. Experiments on CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate that SLaD achieves state-of-the-art performance on both classification and regression tasks, demonstrating the effectiveness of sub-modal label-aware supervision and disentanglement for advancing multimodal sentiment understanding.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16466
Loading