SDRS: Sentiment-Aware Disentangled Representation Shifting for Multimodal Sentiment Analysis

Published: 2025, Last Modified: 10 Feb 2026IEEE Trans. Affect. Comput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multimodal sentiment analysis (MSA) aims to leverage the complementary information from multiple modalities for affective understanding of user-generated videos. Existing methods mainly focused on designing sophisticated feature fusion strategies to integrate the separately extracted multimodal representations, ignoring the interference of the information irrelevant to sentiment. In this paper, we propose to disentangle the unimodal representations into sentiment-specific and sentiment-independent features, the former of which are fused for the MSA task. Specifically, we design a novel Sentiment-aware Disentangled Representation Shifting framework, termed SDRS, with two components. Interactive sentiment-aware representation disentanglement aims to extract sentiment-specific feature representations for each nonverbal modality by considering the contextual influence of other modalities with the newly developed cross-attention autoencoder. Attentive cross-modal representation shifting tries to shift the textual representation in a latent token space using the nonverbal sentiment-specific representations after projection. The shifted representation is finally employed to fine-tune a pre-trained language model for multimodal sentiment analysis. Extensive experiments are conducted on three public benchmark datasets, i.e., CMU-MOSI, CMU-MOSEI, and CH-SIMS. The results demonstrate that the proposed SDRS framework not only obtains state-of-the-art results based solely on multimodal labels but also outperforms the methods that additionally require the labels of each modality.
Loading