Multimodal Fusion Strategies for Emotion Recognition

Published: 01 Jan 2024, Last Modified: 20 May 2025IJCNN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Emotions play a crucial role in our daily lives, influencing how we face challenges throughout the day and shaping our behavior, even when we are not consciously aware of them. Detecting emotional states in others relies on comprehending the collective impact of a variety of actions that emotions can produce, such as facial expressions, posture, tone of voice, or speech. To address this challenge, we propose a multimodal transformer-based model designed to recognize the emotional moods of individuals using video data, including audio and text transcriptions. Consequently, our model extracts the most relevant information from each modality to make a final prediction. Throughout this work, different fusion architectures have been integrated with transformer-based models to determine the optimal combination. This study examines the performance of individual modalities and their combinations using the CMU-MOSEI dataset. This dataset encompasses preprocessed video, audio, and text data. Our best model achieves a weighted accuracy of 85.59% on this dataset, surpassing previous works for this task.
Loading