Multimodal Emotion Recognition via Multilevel Fusion of Visual, Audio, and Textual Data

José Salas-Cáceres, Javier Lorenzo-Navarro, Modesto Castrillón-Santana

Published: 01 Jan 2026, Last Modified: 16 Feb 2026CrossrefEveryoneRevisionsCC BY-SA 4.0

Abstract: Human–machine interactions are becoming increasingly common in society, making it important to improve their user experience. In this regard, an accurate emotion recognition system could substantially benefit the experience. This work presents a novel framework for multimodal emotion recognition that performs fusion at multiple levels, feature and score, to effectively combine visual, audio, and textual information. Modality-specific embeddings are extracted using VGGFace for visual data, a Wav2Vec2-Large-Robust model for audio, and BERT for text. These representations are unified via three different feature-level fusion strategies: concatenation, Embrace, and cross-attention. A subsequent score-level fusion employs an adaptive weighted sum to produce the final class probabilities. On the four-emotion classification task of the IEMOCAP dataset, our approach achieves an unweighted accuracy of 73.53%, which represents solid results comparable with some state-of-the-art baselines and demonstrates the added value of visual cues. Our experiments also analyze the impact of fusion and pooling choices, providing insights for future multimodal systems.

External IDs:doi:10.1007/978-3-032-10192-1_45