A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

ACL ARR 2025 May Submission1208 Authors

16 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Emotion Recognition in Conversations (ERC) requires modeling the temporal context of multi-turn dialogues and the complementary information across modalities. We propose $\textbf{Mi}$xture of $\textbf{S}$peech-$\textbf{T}$ext $\textbf{E}$xperts for $\textbf{R}$ecognition of $\textbf{E}$motions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework that decouples modality-specific context modeling from multimodal integration. MiSTER-E incorporates LLM-based representations for speech and text, uses a convolutional-recurrent layer for context modeling, and integrates unimodal and cross-modal information through a gating mechanism. We introduce a supervised contrastive loss between aligned speech and text representations and a KL-divergence-based regularization to encourage agreement across expert predictions. Notably, our method does not rely on speaker identity during training or inference. Experiments on two benchmark datasets—IEMOCAP and MELD—show that our proposal achieves 70.9% and 69.5% weighted F1-scores respectively, outperforming prior speech-text ERC models. We also provide various ablations to highlight the contributions made in the proposed approach.
Paper Type: Long
Research Area: Computational Social Science and Cultural Analytics
Research Area Keywords: emotion detection and analysis,NLP tools for social analysis
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
Keywords: emotion recognition in conversations, multimodal emotion recognition, LLM embeddings, mixture of experts
Submission Number: 1208
Loading