Audio-Visual Feature Disentanglement and Fusion Network for Automatic Depression Severity Prediction

Shihao Li, Zhuhong Shao, Rongyin Qin, Yongzhen Huang, Peipeng Liang, Xiaobai Li, Yinan Jiang, Yanhe Deng, Tie Liu, Xiaohui Tan

Published: 2026, Last Modified: 02 Apr 2026IEEE Trans. Affect. Comput. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In order to achieve early screening and assist clinical decision-making, automatic depression assessment based on multimodal data are highly anticipated. However, the existed methods often suffer from semantic gap and information redundancy due to heterogeneity among modalities. To address this challenge, this paper investigates a novel Feature Disentanglement and Fusion Network (FDFNet) for predicting depression severity from audio-visual cues. Firstly, we design the shared and private encoders to disentangle modality-shared and modality-private representations. The former representation that acquires joint information is subjected by similarity constraints between modalities to ensure their distributions as close as possible. The latter that can capture unique features of each modality is restrained by independence constraints for keeping their distributions distinct. The decoder is then developed to reconstruct unimodal representation with constraints to minimize information loss. Finally, an efficient fusion strategy through addition and concatenation is ultilized for aggregating information. Experimental results on four benchmark datasets demonstrate that the proposed FDFNet consistently outperforms several state-of-the-art methods, with the competitive MAE/RMSE values of 6.22/7.58 on AVEC2013, 5.21/6.49 on AVEC2014, 4.25/5.34 on DAIC-WOZ, and 4.41/5.10 on E-DAIC, indicating that multimodal deep learning based on audio-visual is an attractive solution for objectively evaluating the depression severity.