Generalizing to Unseen Speakers: Multimodal Emotion Recognition in Conversations With Speaker Generalization

Published: 2025, Last Modified: 07 Jan 2026IEEE Trans. Affect. Comput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Multimodal Emotion Recognition in Conversations (MERC) aims to identify the emotion expressed in each utterance within conversational videos. Current efforts are directed toward modeling speaker-sensitive context dependencies and multimodal fusion. However, they still struggle to handle utterances from unseen speakers, hampering the model’s generalizability. To tackle this challenge, we propose a Speaker Generalization Framework for MERC. Specifically, we build a prototype graph to learn Speaker-based Utterance Representations (SUR), leveraging prototypes as the bridge between seen and unseen speakers. Speaker-aware Contrastive Learning (CL) is then applied to refine SUR, pulling utterances (or prototypes) from the same speaker together while pushing those from different speakers apart. Further, we introduce a prototypical graph CL to generalize SUR to unseen speakers, ensuring that the same speakers exhibit similar graph structures, while dissimilar ones differ. To further enhance model generalization, we introduce Uncertainty-based Generalization for Speakers, randomly sampling SUR statistics from the estimated Gaussian distribution and probabilistically replacing the original SUR. Experimental findings on two datasets highlight that our framework substantially improves the generalization of various MERC models, surpassing state-of-the-art methods.
Loading