Dimensional Emotion Recognition from Speech Using Modulation Spectral Features and Recurrent Neural Networks

Published: 01 Jan 2019, Last Modified: 12 Feb 2025APSIPA 2019EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Dimensional emotion recognition (DER) from speech is used to track the dynamics of emotions for robots to naturally interact with humans. The DER system needs to obtain frame-level feature sequences by selecting the appropriate acoustic features and duration. Moreover, these sequences should reflect the dynamic characteristics of the utterance. Temporal modulation cues are good at capturing the dynamic characteristics for speech perception and understanding. In this paper, we propose a DER system using modulation spectral features (MSFs) and recurrent neural networks (RNNs). The MSFs are obtained from temporal modulation cues, which are produced from auditory front-ends by auditory filtering of speech signals and modulation filtering of the temporal envelope in a cascade manner. Then, the MSFs are fed into RNNs to capture the dynamic change of emotions from the sequences. Our experiments of predicting valence and arousal involving the RECOLA database demonstrated that the proposed system significantly outperforms the baseline systems, improving arousal predictions by 17% and valence predictions by 29.5%.
Loading