MDDR:Multi-modal Dual-Attention aggregation for Depression Recognition

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Automated diagnosis of depression is crucial for early detection and timely intervention. Previous research has largely concentrated on visual indicators, often neglecting the value of leveraging a variety of data types. Although some studies have attempted to employ multiple modalities, they typically fall short in investigating the complex dynamics between features from various modalities over time. To address this challenge, we present an innovative Multi-modal Dual-Attention Aggregation Architecture for Depression Recognition (MDDR). This framework capitalizes on multi-modal pre-trained features and introduces two attention aggregation mechanisms: the Feature Alignment and Aggregation (FAA) module and the Sequence Encoding and Aggregation (SEA) module. The FAA module is designed to dynamically evaluate the relevance of multi-modal features for each instance, facilitating a dynamic integration of these features over time. Following this, the SEA module determines the importance of the amalgamated features for each frame, ensuring that aggregation is conducted based on their significance, to extract the most relevant features for accurately diagnosing depression. Moreover, we propose a unique loss calculation method specifically designed for depression assessment, named DRLoss. Our approach, evaluated on the AVEC2013 and AVEC2014 depression audiovisual datasets, achieves unparalleled performance.
Primary Subject Area: [Engagement] Emotional and Social Signals
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: We propose a novel Dual-Attention Aggregation module that enhances the capability to adaptively fuse Multi-modal information and capture the interrelations of fused features across different times. This method enables a more nuanced understanding of complex data and captures hidden cues within, offering significant improvements in tasks that require the integration of various modalities, such as recognizing depression from facial videos. It addressing previous limitations related to the static and isolated processing of multimodal data and opening new avenues for research and application in multimedia analysis.
Submission Number: 4290
Loading