Multi-Granularity Temporal-Spectral Representation Learning for Speech Emotion Recognition

Published: 01 Jan 2024, Last Modified: 20 May 2025SMC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Speech emotion recognition (SER) captures emotional information from speech signals to recognize users' emotional states, which plays a crucial role in conversational human-computer interaction. Most SER researches focus on ex-ploiting emotional information from global temporal or spectral features, but it may neglect detailed emotion-related information such as phonemes and syllables. To address this problem, this paper proposes a multi-granularity temporal-spectral representation learning (MG-TSRL) network for speech emotion recognition tasks. Specifically, MG-TSRL extracts different temporal features in phonetic, syllabic, and sentential granular-ity from spectrograms to retain more detailed emotional-related information. It then designs multilayer emotion-aware units to capture emotion-related frequency patterns and obtain deep spectrum features at each temporal granularity feature. MG-TSRL further introduces a fast broad learning system and feeds deep temporal-spectral features to it to obtain more accurate emotions. MG-TSRL gradually achieves effective temporal-spectral representation learning through multi-granularity temporal features and multilayer frequency pattern learning. The state-of-the-art results on the CASIA, RAVDESS, and SAVEE datasets are respectively 95.17%, 92.78%, and 87.50% in unweighted accuracy, demonstrating the effectiveness of MG-TSRL in speech emotion recognition.
Loading