Abstract: The detection of depression through non-verbal cues has gained significant attention. Previous research predominantly centered on identifying depression within the confines of controlled laboratory environments, often with the supervision of psychologists or counselors. Unfortunately, datasets generated in such controlled settings may struggle to account for individual's behaviors in real-life situations. In response to this limitation, we present the Extended D-vlog dataset, encompassing a collection of $1,261$ YouTube vlogs. We extracted features across the auditory, textual, and visual Modalities from this Extended D-vlog dataset. To effectively capture the interrelationship between these features and derive a multi-modal representation involving audio, video, and text, we harnessed the TVLT model. Remarkably, the utilization of the TVLT model, in conjunction with video, text, and audio (leveraging wav2vec2 features and spectrograms), produced the most promising results, achieving a remarkable $\textbf{F1-score of $67.8$\%}$
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies
Loading