Towards robust multimodal emotion recognition in conversation with multi-modal transformer and variational distillation fusion

Published: 2025, Last Modified: 21 Jan 2026J. Intell. Inf. Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multimodal Emotion Recognition in Conversation (MERC) utilizes multimodal information such as language, visual, and audio to enhance the understanding of human emotions. Current multimodal interaction frameworks inadequately resolve inherent information conflicts and redundancy due to their assumption of equivalent quality across heterogeneous modalities. In addition, inappropriate evaluation of the importance of modalities can also cause this problem. To address this issue, we introduce a \(\textbf{L}\)anguage-\(\textbf{F}\)ocused Augmented Transformer with \(\textbf{V}\)ariational \(\textbf{D}\)istillation Fusion network called \(\textbf{LFVD}\). In contrast to previous work, we suggest focusing on language modality through the Language-Focused Augmented Transformer, which extracts task-relevant signals from visual and audio modalities to help us understand language. Concurrently, this architecture derives conversational emotional atmosphere representation to refine multimodal integration, thereby mitigating the influence of redundant and conflicting information. Furthermore, Variational Distillation Fusion has been proposed in which multimodal representations are probabilistically encoded as variational distributions over Gaussian manifolds rather than deterministic embeddings. Subsequently, the importance of each modality is estimated automatically based on distribution differences. Experiments on the IEMOCAP and MELD datasets show that our proposed model outperforms previous state-of-the-art baseline models.
Loading