MFMamba: A Multimodal Fusion State Space Model for Depression Recognition

Published: 2025, Last Modified: 07 Nov 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Depression is a severe mental illness, and extracting emotional information from video-audio signals for multimodal depression recognition is a challenging problem. Recent methods use the self-attention (SA) mechanism from Transformers to capture the dynamic relationships between different modalities. However, the quadratic computational complexity of SA reduces its effectiveness in modeling long sequences, making it insufficient for capturing complex intra-modal and inter-modal complementarity. To address this issue, this work proposes a Multimodal Fusion Mamba (MFMamba) framework, which is attention-free and purely focuses on using state space models (SSMs) for long-sequence modeling. Specifically, we devise Video Spatio-Temporal Mamba (VSTMamba) and Audio Temporal Mamba (ATMamba) for video-audio feature extraction. To fully capture the correlations among multimodal features and eliminate information redundancy, we introduce Fusion Mamba (FMamba) to integrate various features effectively. In experiments on AVEC 2013 and AVEC 2014 datasets, our method achieved competitive results.
Loading