Robust Representation Learning for Multimodal Emotion Recognition with Contrastive Learning and Mixup

Published: 01 Jan 2024, Last Modified: 15 May 2025MRAC@MM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multimodal emotion recognition (MER) plays a crucial role in user sentiment analysis and enhancing human-computer interaction experiences. However, in real-world scenarios, noise interference such as environmental noise and image blurring is widespread, impacting the model's ability to recognize emotions. Therefore, improving the robustness of MER models to handle complex and variable noise interference is a significant challenge. To address this challenge, we propose a robust representation learning method based on contrastive learning and Mixup data augmentation strategies to stabilize model performance on noisy data. Specifically, we first perform Mixup data augmentation in the representation space of each modality to broaden class decision boundaries and enhance model generalization. Meanwhile, through a contrastive learning strategy, we align the representations of clean and noisy data to improve model robustness. We evaluate our model's performance on the test set of the MER2024 MER-NOISE track. Our model achieves a weighted average F-score score of 82.71%, which is 3.09% higher than the MER2024 Baseline model. This result demonstrates that our method effectively enhances model robustness and generalization.
Loading