A Study on The Impact of Foundation Models on Automatic Depression Detection from Speech Signals

Published: 2025, Last Modified: 08 Jan 2026INTERSPEECH 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: An automatic depression detection (ADD) system using spoken language offers the opportunity to develop practical, low-cost tools to detect symptoms early. However, limited data availability, privacy concerns, and transcription efforts pose significant challenges. Recent advancements in foundational models, capable of understanding and processing multimodal inputs, present opportunities for enhancing ADD systems. This study explores various speech foundation models to investigate their impact on ADD. We leverage Whisper and MMS for automatic transcription and integrate speech and text embeddings into a language model optimized with low-rank adaptation (LoRA). In addition, we examine the effects of fine-tuning strategies and prompt formats on model performance. We used English and Bengali datasets to demonstrate the potential of our method in ADD, even with moderate-quality transcriptions. The best speech and language foundation models outperform baseline models on both datasets.
Loading