SERC-GCN: Speech Emotion Recognition In Conversation Using Graph Convolutional Networks

Published: 01 Jan 2024, Last Modified: 20 May 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Speech emotion recognition (SER) is the task of automatically recognizing emotions expressed in spoken language. Current approaches focus on analyzing isolated speech segments to identify a speaker’s emotional state. Meanwhile, recent text-based emotion recognition methods have effectively shifted towards emotion recognition in conversation (ERC) that considers conversational context. Motivated by this shift, here we propose SERC-GCN, a method for speech emotion recognition in conversation (SERC) that predicts a speaker’s emotional state by incorporating conversational context, speaker interactions, and temporal dependencies between utterances. SERC-GCN is a two-stage method. First, emotional features of utterance-level speech signals are extracted. Then, these features are used to form conversation graphs that are used to train a graph convolutional network to perform SERC. We empirically evaluate the effectiveness of SERC-GCN and show that it outperforms the current state-of-the-art methods on the IEMOCAP benchmark dataset.
Loading