MERSA: Multimodal Emotion Recognition with Self-Align Embedding

Quan Bao Le, Kiet Tuan Trinh, Nguyen Dinh Hung Son, Phuong-Nam Tran, Cuong Tuan Nguyen, Duc Ngoc Minh Dang

Published: 01 Jan 2024, Last Modified: 15 May 2025ICOIN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Emotions are an integral part of human communication and interaction, significantly shaping our social connections, decision-making, and overall well-being. Understanding and analyzing emotions have become essential in various fields, including psychology, human-computer interaction, marketing, and healthcare. The previous approach has indeed made significant strides in improving the accuracy of predicting emotions within speech. However, the current model’s performance still falls short when it comes to real-life applications. This limitation arises due to several factors such as lack of context, ambiguity in speech and meaning, and other contributing elements. To reduce the ambiguity of emotions within speech, this paper seeks to leverage multiple data modalities, specifically textual and acoustic information. To analyze these modalities, we propose a novel approach called MERSA which utilizes the self-align method to extract context features from both textual and acoustic information. By leveraging this technique, the MERSA model can effectively create fusion feature vectors of the multiple inputs, facilitating a more accurate and holistic analysis of emotions within speech. Moreover, the MERSA model has incorporated a cross-attention module into its network architecture, which enables the MERSA model to capture and leverage the interdependencies between the textual and acoustic modalities.