Multimodal Depression Detection with Contextual Position Encoding and Latent Space Regularization

Enshi Zhang; Christian Poellabauer

Multimodal Depression Detection with Contextual Position Encoding and Latent Space Regularization

Enshi Zhang, Christian Poellabauer

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: depression detection, speech and language processing, contextual position encoding, latent space regularization

TL;DR: A multimodal approach to detecting depression using participants' audio and interviewers' text prompts.

Abstract: Clinical interviews are the gold standard for detecting depression, and previous work using multimodal features from participants' audio, transcribed text, and video have shown promising results. Recent approaches further improve performance by incorporating an additional textual modality—the interviewer’s prompts—during training. However, these approaches risk introducing biases, as models may over-rely on specific prompt-response pairs, which may not always be present in real-world settings. This leads to models exploiting these cues as shortcuts for detecting depression rather than learning the language and behaviors that genuinely indicate the subject's mental health, ultimately undermining consistency and objectivity. To address this, we propose a novel approach that combines Contextual Position Encoding (**CoPE**) and Latent Space Regularization (**LSR**), leveraging both subjects' responses (audio) and the interviewer's prompts (text). CoPE captures the evolving context of the interview, ensuring that the model utilizes insights from the entire conversation, preventing over-reliance on isolated or late-stage cues. This helps the model understand interactions holistically and more accurately reflect mental health indicators. LSR introduces constraints to enforce consistency in the model’s learned representations, reducing overfitting to superficial cues and guiding the model toward more generalizable patterns. By smoothing the latent space, LSR helps the model focus on meaningful, high-level representations of both audio and text. Our approach yields competitive results on the **DAIC-WOZ** benchmark and surpasses the state-of-the-art on the **EATD** benchmark. The code is released.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13181

Loading