Parallel Multiscale Bridge Fusion Network for Audio-Visual Automatic Depression Assessment

Min Hu; Lei Liu; Xiaohua Wang; Yiming Tang; Jiaoyun Yang; Ning An

Parallel Multiscale Bridge Fusion Network for Audio-Visual Automatic Depression Assessment

Min Hu, Lei Liu, Xiaohua Wang, Yiming Tang, Jiaoyun Yang, Ning An

Published: 01 Jan 2024, Last Modified: 20 Feb 2025IEEE Trans. Comput. Soc. Syst. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Depression is a prevalent and severe mental illness that significantly impacts patients’ physical health and daily life. Recent studies have focused on multimodal depression assessment, aiming to objectively and conveniently evaluate depression using multimodal data. However, existing methods based on audio–visual modalities struggle to capture the dynamic variations in depression clues and cannot fully explore multimodal data over a long time. In addition, they rely heavily on insufficient single-stage multimodal fusion, which limits the accuracy of depression assessment. To address these limitations, we propose a novel parallel multiscale bridge fusion network (PMBFN) for audio–visual depression assessment. PMBFN comprehensively captures subtle multilevel dynamic changes in depression expression through parallel multiscale dynamic convolutions and long short-term memories (LSTMs) and effectively solves the problem of long-term audio–visual sequence information loss by using spatiotemporal attention pooling modules. Furthermore, the multimodal bridge fusion module is proposed in PMBFN to achieve multistage interactive recursive multimodal fusion, enhancing the expressive capacity of multimodal depression-related features to improve the accuracy of assessment. Extensive experiments on the DAIC-WOZ and E-DAIC datasets demonstrate that our method outperforms current state-of-the-art methods and clearly shows our method's effectiveness eventually.

Loading