Multimodal Interpretable Depression Analysis Using Visual, Physiological, Audio and Textual Data

Puneet Kumar, Shreshtha Misra, Zhuhong Shao, Bin Zhu, Balasubramanian Raman, Xiaobai Li

Published: 2025, Last Modified: 07 Nov 2025WACV 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Motivated by depression's significant impact on global health, this work proposes MultiDepNet, a novel multi-modal interpretable depression detection system integrating visual, physiological, audio, and textual data. Through ded-icated feature extraction methods (MTCNN for video, TS-CAN for physiological, ResNet-18 for audio, and RoBERTa for text modalities) and a strategic fusion of modality-specific networks including CNN-RNN, Transformer, MLP, and ResNet-18, it achieves significant advancements in depression detection. Its performance, evaluated across four benchmark datasets (AVEC 2013, AVEC 2014, DAIC, and E-DAIC), demonstrates average MAE of 5.64, RMSE of 7.15, accuracy of 74.19%, precision of 0.7373, re-call of 0.7378, and F1 of 0.7376. It also implements a Multiviz-based interpretability mechanism that computes each modality's contribution to the model's performance. The results reveal the visual modality to be the most signifi-cant, contributing 37.88% towards depression detection.

External IDs:dblp:conf/wacv/0003MSZRL25