Keywords: Multimodal Fusion, Depression Prediction, Regression, Low-Resource Clinical Data, Intra-Inter Modality
TL;DR: This study presents a multimodal fusion architecture that integrates text, audio, and visual signals from E-DAIC to enhance the accuracy of predicting depression in resource-limited clinical settings.
Abstract: Depression prediction using clinical interviews is problematic due to small sample numbers, class imbalance, and missing modalities. The English E-DAIC corpus illustrates these limitations by offering multimodal recordings accompanied by PHQ-8 scores. We propose a hierarchical fusion approach for regression that initially enhances individual modalities by amalgamating handmade descriptors (e.g., eGeMAPS, OpenFace cues) with deep embeddings (e.g., BERT, VGGish), followed by their integration via attention-based inter-modal fusion.\\
Our research presents three key contributions. We present the inaugural systematic application of intra- and inter-modal fusion for regression in English clinical interviews, building upon previous research that focused on categorization or non-English datasets. Secondly, we perform a definitive robustness assessment in the presence of absent modalities, reinterpreting bimodal and trimodal results to measure modality significance and durability when data streams are deficient. Third, we illustrate that hierarchical fusion enhances generalizability in small, imbalanced clinical datasets, consistently surpassing robust baselines across MAE, RMSE, R², and CCC. Collectively, these results confirm structured multimodal regression as a dependable method for low-resource clinical environments and provide a foundation for interpretable, robust mental health artificial intelligence.
Submission Number: 9
Loading