Abstract: Multi-modal emotion recognition has attracted increasing attention in human-computer interaction, as it extracts complementary information from physiological and behavioral features. Compared to single modal approaches, multi-modal fusion methods are more susceptible to uncertainty in emotion recognition, such as heterogeneity and inconsistent predictions across different modalities. Previous multi-modal approaches ignore systematic modeling of uncertainty in fusion and revelation of dynamic variations in emotion process. In this article, we propose a dynamic confidence-aware fusion network for robust recognition of heterogeneous emotion features, including electroencephalogram (EEG) and facial expression. First, we develop a self-attention based multi-channel LSTM network to preliminarily align the heterogeneous emotion features. Second, we propose a confidence regression network to estimate true class probability (TCP) on each modality, which helps explore the uncertainty at modality level. Then, different modalities are weighted fused according to above two types of uncertainty. Finally, we adopt self-paced learning (SPL) mechanism to further improve the model robustness by alleviating negative effect from the hard learning samples. The experimental results on several multi-modal emotion datasets demonstrate the proposed method outperforms the state-of-the-art methods in emotion recognition performance and explicitly reveals the dynamic variation of emotion with uncertainty estimation.
Loading