Depression Scale Dictionary Decomposition Framework for Multimodal Automatic Depression Level Prediction

Mingyue Niu, Xu Wang, Jibing Gong, Bin Liu, Jianhua Tao, Björn W. Schuller

Published: 01 Jan 2025, Last Modified: 13 Mar 2026IEEE Transactions on Circuits and Systems for Video TechnologyEveryoneRevisionsCC BY-SA 4.0

Abstract: Currently, many researchers aim to achieve automatic depression level prediction via speech and video behavior analysis. However, previous works have struggled to decompose audio and video sequences into the information related to and unrelated to depression scores, hindering the model’s perception of depression cues. Besides, previous works implement multimodal fusion using attention mechanisms or linear layers, but failed to simultaneously consider the Euclidean relationship among tokens and the non-Euclidean relationship among channels, which bring limitations in capturing depression cues. In response to the above issues, we propose a depression scale dictionary decomposition framework, which mainly includes a Bidirectional Dictionary Decomposition (BDD) module and a Bidirectional Multimodal Fusion (BMF) module. The BDD module can use the dictionaries generated based on the depression scale to semantically decompose audio and video sequences into the information related to and unrelated to depression scores along token and channel dimensions for promoting depression cue perception. Moreover, considering the respective characteristics of tokens and channels, the BMF module uses linear layers and graph convolution to achieve cross-modal mixing, which is used to aggregate audio and video sequences for predicting depression levels. The validation on AVEC 2013, AVEC 2014 and DAIC-WOZ datasets demonstrates our method’s superiority.

External IDs:doi:10.1109/tcsvt.2025.3533480