Exploring Fusion Techniques for Multimodal Emotion Recognition

Alexandru-Lucian Georgescu, Gheorghe-Iulian Chivu, Horia Cucu

Published: 01 Jan 2024, Last Modified: 10 Apr 2025COMM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Emotion recognition is crucial for improving human-computer interaction, allowing machines to understand and respond to emotions. This enhances user experience and personalizes interactions, benefiting applications like virtual assistants and mental health diagnostics. In this paper, we propose a multimodal emotion recognition system that relies on speech and visual information. For the speech-based modality, we conducted experiments using a CNN architecture along with five types of handcrafted speech features. For visual emotion recognition, we evaluated a ResNet18-based network, leveraging its potential by fine-tuning a pretrained model based on a larger dataset, ImageNet. After developing these two separate classifier architectures, one for speech and the other for visual information, we studied multiple late fusion techniques to effectively maximize emotion recognition from these two sources of information. Our best multimodal system achieved an accuracy of 97.95%, demonstrating an improvement over the individual modalities and other benchmark systems.