A Unimodal Valence-Arousal Driven Contrastive Learning Framework for Multimodal Multi-Label Emotion Recognition

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multimodal Multi-Label Emotion Recognition (MMER) aims to identify one or more emotion categories expressed by an utterance of a speaker. Despite obtaining promising results, previous studies on MMER represent each emotion category using a one-hot vector and ignore the intrinsic relations between emotions. Moreover, existing works mainly learn the unimodal representation based on the multimodal supervision signal of a single sample, failing to explicitly capture the unique emotional state of each modality as well as its emotional correlation between samples. To overcome these issues, we propose a $\textbf{Uni}$modal $\textbf{V}$alence-$\textbf{A}$rousal driven contrastive learning framework (UniVA) for the MMER task. Specifically, we adopt the valence-arousal (VA) space to represent each emotion category and regard the emotion correlation in the VA space as priors to learn the emotion category representation. Moreover, we employ pre-trained unimodal VA models to obtain the VA scores for each modality of the training samples, and then leverage the VA scores to construct positive and negative samples, followed by applying supervised contrastive learning to learn the VA-aware unimodal representations for multi-label emotion prediction. Experimental results on two benchmark datasets MOSEI and M$^3$ED show that the proposed UniVA framework consistently outperforms a number of existing methods for the MMER task.
Primary Subject Area: [Engagement] Emotional and Social Signals
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: Our work innovatively contributes to the multimodal domain by presenting a novel framework for emotion recognition. This framework goes beyond traditional methods that use a one-hot vector to represent each emotion, which ignore the intrinsic relationships between emotions. Instead, we utilize the valence-arousal (VA) space to represent each emotion category, leveraging VA scores to highlight the similarities or differences between emotions. Moreover, our framework addresses key limitations in existing multimodal multi-label emotion recognition (MMER) research by capturing the unique emotional states of each modality and their correlations across samples. This achievement is facilitated through the use of pre-trained unimodal VA models and the design of a VA-driven contrastive learning algorithm, which significantly improves the performance of multi-label emotion prediction. Experimental results on two benchmark datasets demonstrate the effectiveness of our approach.
Supplementary Material: zip
Submission Number: 5210
Loading