Integrating Deep Facial Priors Into Landmarks for Privacy Preserving Multimodal Depression Recognition

Yuchen Pan, Yuanyuan Shang, Zhuhong Shao, Tie Liu, Guodong Guo, Hui Ding

Published: 01 Jan 2024, Last Modified: 14 Nov 2024IEEE Trans. Affect. Comput. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Automatic depression diagnosis is a challenging problem, that requires integrating spatial-temporal information and extracting features from audio-visual signals. In terms of privacy protection, the development trend of recognition algorithms based on facial landmarks has created additional challenges and difficulties. In this article, we propose an audio-visual attention network (AVA-DepressNet) for depression recognition. It is a novel multimodal framework with facial privacy protection, and uses attention-based modules to enhance audio-visual spatial and temporal features. In addition, an adversarial multistage (AMS) training strategy is developed to optimize the encoder-decoder structure. Additionally, facial structure prior knowledge is creatively used in AMS training. Our AVA-DepressNet is evaluated on popular audio-visual depression datasets: AVEC 2013, AVEC 2014, and AVEC 2017. The results show that our approach reaches the state-of-the-art performance or competitive results for depression recognition.