Abstract: Automatic multi-modal depression recognition using artificial intelligence technology is crucial to advance early diagnosis and treatment. Existing methods suffer from a weak performance in detecting depression due to incomplete unimodal semantic information and insufficient fusion effects. To address these challenges, we propose a novel Semantic-Enhanced Dual Cross-modal Fusion Network (SE-DCFN) for multi-modal depression recognition, specifically designed for text-audio data. Firstly, we utilize a prompt learning-based text encoder and a language-audio pertaining-based audio encoder to capture specific information to enhance the semantic representation. Then, we introduce a dual cross-modal fusion module based on self-attention and cross-attention mechanisms to effectively explore linguistic and acoustic representation, facilitating inter-modal and intra-modal interaction and fusion. Additionally, a triplet contrastive loss is formulated to optimize the training process of the SE-DCFN. Experimental results on the EATD-Corpus dataset and AVEC-2017 dataset demonstrate the effectiveness and superiority of our proposed SE-DCFN on multi-modal depression recognition, outperforming existing methods.
External IDs:dblp:conf/bibm/HuY0HCM24
Loading