MDMU:Multimodal Dynamic Mamba UNet for Multimodal sentiment analysis

Published: 2025, Last Modified: 06 Mar 2026ICME 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In multimodal sentiment analysis, linguistic, visual, and audio sequence data are utilized to assess users’ emotional intensity. However, due to discrepancies in sampling rates across different modalities, sequence alignment poses a significant challenge. While cross-attention-based methods can effectively address this issue, the pairwise attention computation across three modalities incurs substantial computational overhead. To address these challenges, we propose the Multimodal Dynamic Mamba UNet (MDMU) framework, which represents the first integration of an UNet-like structure with Mamba for multimodal sequences modeling. The UNet architecture is employed to capture temporal interactions across modalities, while the Mamba module is utilized for semantic feature modeling, ensuring computational efficiency with near-linear complexity. Additionally, we introduce the Multimodal Momentum Contrast (MMC) method, which also eschews pairwise interactions between the three modalities in favor of a unified approach. MMC facilitates fine-grained fusion between modalities by dynamically constructing a large set of hard negative samples to enhance intermodal interactions. Experimental results demonstrate that even non-strictly aligned temporal interactions benefit the model, offering a novel perspective for multimodal sequence modeling. Our code is available at https://github.com/SCNU-RISLAB/MDMU.
Loading