DDSE: A Decoupled Dual-Stream Enhanced Framework for Multimodal Sentiment Analysis with Text-Centric SSM

Shenjie Jiang, Zhuoyu Wang, Xuecheng Wu, Hongru Ji, Mingxin Li, Xianghua Li, Chao Gao

Published: 27 Oct 2025, Last Modified: 21 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: Multimodal Sentiment Analysis (MSA) aims to identify sentiment polarity and intensity in media. Current methods typically employ a two-stage pipeline: extracting features from each modality, then predicting sentiment based on fused representations. However, most fusion strategies align features from different modalities in a single step, leading to conflicts during cross-modal interactions and hindering the modeling of hierarchical sentiment dependencies. Additionally, existing methods often overlook the dominant role of textual modality in high level latent fusion space, causing explicit linguistic sentiment cues to be obscured by redundant information. To address these issues, DDSE (Decoupled Dual-Stream Enhanced framework) is proposed in this work, which decouples features into public and private representations for improved feature enhancement and cross-modal interaction. The proposed TC-Mamba module enables progressive cross-modal interactions within shared state transition matrices under a text-guided fusion paradigm, effectively preserving sentiment cues and minimizing redundancy. Additionally, DDSE adopts a multi-task learning strategy to further enhance overall performance. Extensive experiments on the MOSI and MOSEI datasets demonstrate that DDSE achieves state-of-the-art results, with Acc-5 improvements of 3.06% and 0.1%, respectively, underscoring its effectiveness in MSA. Ablation studies confirm the critical contributions of each component within the framework. Code is available at https://anonymous.4open.science/r/DDSE-76D6.
Loading