Speaker-Centric Multimodal Fusion Networks for Emotion Recognition in Conversations

Biyun Yao; Wuzhen Shi

Speaker-Centric Multimodal Fusion Networks for Emotion Recognition in Conversations

Biyun Yao, Wuzhen Shi

Published: 01 Jan 2024, Last Modified: 07 Mar 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Existing emotion recognition methods in conversations (ERC) focus on using different utterances information between speakers to improve emotion recognition performance, but they ignore the differential contributions of different utterances to emotion recognition. In this paper, we propose a speaker-centric multimodal fusion network for ERC, in which bidirectional gated recurrent units (BiGRU) is used for intra-modal feature fusion and graph convolution is used for speaker-centric cross-modal feature fusion. We construct a speaker-centric graph based on the differences between one speaker’s utterances and that of the other speakers. This graph enhances the network’s focus on each speaker’s own utterance information, effectively reducing interference from other speakers. Simultaneously, we employ a Utterance Distance Attention (UDA) module, tailoring the attention allocation to mitigate the impact of distant utterances on the current utterance. Experimental results on IEMOCAP and MELD demonstrate the effectiveness of our approach.

Loading