Abstract: Psychological emotion can be typically categorized into discrete emotion states, such as anger, happiness, and neutrality labels, as well as estimated as degrees of a 2D continuous valence-arousal (VA) space. Previous studies on multimodal emotion recognition have employed fusion mechanisms for multiple modalities but treated emotion labels and VA degrees as separate recognition tasks. By modeling the relationship between these different labels, it becomes possible to leverage training datasets with different types of labels for improving multimodal emotion recognition. In this study, we explore the utilization of multiple labels by employing a 2D Kernel Density Estimation (2D-KDE) method to mathematically model their relations. Subsequently, we propose a label fusion layer (LFL) based on these relations to adjust the predicted probabilities of emotion states obtained from existing baselines of multimodal emotion recognition networks. Through extensive experiments, we demonstrate the effectiveness of our proposed model in improving emotion recognition performance and achieving superior results on the IEMOCAP and OMG-Emotion datasets.
Loading