Multimodal Blockwise Transformer for Robust Sentiment Recognition

Published: 2024, Last Modified: 10 Nov 2025MRAC@MM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The MER-NOISE challenges participants to classify emotions from multimodal data, specifically audio and visual, with added noise. In this paper, we present a solution for the NOISE track of the MER2024 competition, which focuses on the robustness of emotion recognition in noisy environments. We propose a novel multimodal Blockwise Transformer (MBT) architecture, which effectively integrates visual, auditory, and textual features to improve emotion classification accuracy. Our approach includes several key innovations: the MBT network structure, the TIE module for weighted encoder input, and the momentum contrast. Additionally, we employed diverse data augmentation methods, both conventional and novel, and introduced a confidence-based decision-level fusion strategy to enhance model performance. In the MER2024 NOISE track, our solution achieved a Weighted Average F-score (WAF) of 0.8365, securing third place. This result demonstrates the effectiveness and robustness of our approach in handling noisy data for emotion recognition tasks.
Loading