Affect-salient event sequence modelling for continuous speech emotion recognition

Yizhuo Dong, Xinyu Yang

2021 (modified: 13 Nov 2021)Neurocomputing 2021Readers: Everyone

Abstract: Continuous speech emotion recognition, which faces the problems of delay caused by annotators’ reaction time and noise caused by non-emotional segments, is a challenging subject in the field of affective computing. To solve these problems, we propose a new affect-salient event sequence modelling (ASESM) method based on connectionist temporal classification (CTC). This method treats a sentence’s label as a chain of affect-salient event (ASE) and non-affect-salient event Null states rather than continuous emotional value. With this representation, a CTC-based convolutional neural network (CNN) is built to automatically label the sentence’s emotional segments with ASE and non-emotional segments with Null, so as to reduce the impact of noise caused by non-emotional segments. Furthermore, we propose an event probability vector decoding (EPVD) algorithm to search the optimal ASE sequence from the CTC loss matrix and mark the occurrence time of each event within this sequence. Then, the arousal and valence ground-truth annotations of each ASE are used to represent the continuous emotional value of a segment which is predicted as the ASE. Since the ground-truth annotations of each ASE have contained different time-delays, taking events as the target can avoid the additional reaction delay compensation. We test our method on the RECOLA and AVEC 2014 benchmark databases. The experimental results demonstrate that the proposed event-based method can improve the performance of continuous emotion recognition and the improvement is more obvious when the selected ASE has high annotation consistency.

0 Replies