Deep neural network-based analysis of voice biomarkers for monitoring treatment response in adolescent major depressive disorder

June-Woo Kim, Haram Yoon, Bung-Nyun Kim, Sang-Yeol Lee, Dae-Jin Kim, Seong-Eun Moon, Yera Choi, Chan-Mo Yang

Published: 04 Feb 2026, Last Modified: 26 May 2026Communications MedicineEveryoneRevisionsCC BY-SA 4.0

Abstract: In adolescents, identifying objective biomarkers for treatment response is crucial for the development of effective interventions. Voice-based biomarkers have recently shown potential to capture treatment-related changes in Major Depressive Disorder (MDD). While prior studies have been cross-sectional experiments with single speech sample, this study addresses a critical gap by evaluating intra-patient changes in speech over treatment period, providing insight into how these voice biomarkers evolve within individuals. We collected pre- and post-treatment voice samples from 48 adolescent MDD patients. We hypothesized that deep learning models could detect clinically meaningful changes in depressive states during treatment. Therefore, we compared machine learning and deep learning models for depressive classification. Additionally, we introduced the Dual Voice-based Depressive State Analysis (DVDSA) method to categorize intra-patient depressive state changes as recovery, worsening, or unchanged, highlighting the deep learning models’ ability to detect these variations. Among the acoustic features, only the fundamental frequency exhibits significant changes between pre- and post-treatment states after Holm-Bonferroni correction. Machine learning models demonstrate limited performance in distinguishing treatment states, with the best F1-score reaching 65.83%. In contrast, deep learning model, particularly WavLM, achieves remarkably higher performance in binary classification, with an F1-score of 78.05%. The WavLM maintains robust performance, when applied to the DVDSA method, achieves an F1-score of 70.58%. These findings suggest that machine learning models and individual acoustic features may not sufficiently capture treatment-related changes in MDD patients. This study underscores the value of deep learning models using the DVDSA method, addressing the limitations of pre- and post-treatment classification and highlighting their potential to advance personalized treatment strategies for adolescent MDD. This study explored whether changes in teenagers’ voices could indicate depression improvement or worsening during therapy. We collected voice recordings from 48 adolescents diagnosed with depression, both before and after they received standard clinical treatment. The voice samples were recorded while participants performed a simple color-naming task which known as the Stroop test that measures attention. Using these recordings, we investigated whether two types of computational models, deep learning and machine learning, could distinguish changes after treatment. We found the fundamental frequency which reflects the pitch of the voice was the only speech feature that clearly changed after treatment. The deep learning models were better at detecting these differences. Moreover, we classified a patient’s mental state from two speech samples as recovered, worsened, or unchanged. This suggests that voice analysis may help personalize future mental health treatment for adolescents with depression. Kim, Yoon et al. analyze pre- and post-treatment voice recordings from 48 adolescents with major depressive disorder using machine learning and deep learning models. Deep learning outperformed machine learning approaches and accurately identified recovery, worsening, or unchanged states.

External IDs:doi:10.1038/s43856-025-01326-3