Investigation of Explainable Multimodal Methods for Detecting Mental Disorders

Mikhail Dolgushin, Daria Guseva, Alexey Karpov

Published: 2025, Last Modified: 02 Mar 2026SPECOM (1) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper introduces an interpretable, multimodal approach for detecting cognitive disorders, specifically depression and Parkinson’s disease, using non-medical video data from the WSM dataset. Addressing the critical need for explainability in automated health assessment, this study combines interpretable audio, visual, and textual features to bridge the gap between diagnostic accuracy and transparency. Our methodology utilizes acoustic features (eGeMAPS), linguistic and prosodic features (BlaBla), and visual cues (facial landmarks, pose, and personality/emotion traits from OCEAN-AI framework) extracted from spontaneous speech and video recordings. Classical machine learning models, such as Logistic Regression, SVM, Decision Trees, Random Forests, are employed for classification, with performance benchmarked against neural network-based models. Experiments demonstrate that interpretable feature ensembles achieve competitive results, reaching up to 77.8% UAR for depression and 66.9% UAR for Parkinson’s disease on the test subsets. SHAP value analysis highlights the importance of specific facial landmarks and linguistic features in driving accurate predictions. These results underscore the potential of computationally efficient, clinically relevant, and transparent multimodal methods for practical and accessible mental health screening, particularly in noisy, real-world settings. Future research will focus on refining feature selection, data cleaning, and exploring explainable attention mechanisms within deep learning models to further improve both accuracy and interpretability on medically annotated datasets.
Loading