Time-Frequency Representation Learning with Graph Convolutional Network for Dialogue-Level Speech Emotion Recognition

Published: 2021, Last Modified: 22 Jan 2026Interspeech 2021EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: With the development of speech emotion recognition (SER), dialogue-level SER (DSER) is more aligned with actual scenarios. In this paper, we propose a DSER approach that includes two stages of representation learning: intra-utterance representation learning and inter-utterance representation learning. In the intra-utterance representation learning stage, traditional convolutional neural network (CNN) has demonstrated great success. However, the basic design of a CNN restricts its ability to model the local and global information in the spectrogram. Therefore, we propose a novel local-global representation learning method for the intra-utterance stage. The local information is learned by a time-frequency convolutional neural network (TFCNN), which we published previously. Here, we propose a time-frequency capsule neural network (TFCap) to model global information that can extract more stable global time-frequency information directly from spectrograms. In the inter-utterance stage, a graph convolutional network (GCN) is introduced to explore the relations between utterances in a dialog. Our proposed methods were evaluated on the IEMOCAP database. The proposed time-frequency based method in the intra-utterance stage achieves an absolute increase of 9.35% compared to CNN. By integrating GCN in the inter-utterance stage, the proposed approach achieves an absolute increase of 4.05% compared to the model in the previous stage.
Loading