Speech Emotion Recognition Model Based on CRNN-CTC

Zijiang Zhu, Weihuang Dai, Yi Hu, Junhua Wang, Junshan Li

Published: 2020, Last Modified: 15 May 2024ATCI 2020EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: CRNN (Convolutional Recurrent Neural Network) deep learning model is currently a typical speech emotion recognition technology. When this model is applied, no matter how long the speech sequence is, it will only be converted into an emotional tag. However, the emotional information in speech samples is generally unevenly distributed between frames, which will cause the recognition performance of the model to deteriorate. For this problem, a speech emotion recognition model based on CRNN-CTC (Convolutional Recurrent Neural Network-Connectionist Temporal Classification) is proposed in this paper. On the basis of CRNN model, the speech samples are divided into emotional frames and non-emotional frames first, and then CTC method is used to make the network model focus on the emotional frames of speech for learning to avoid the problem of poor model performance due to the learning of non-emotional frames. Experimental results show that the model achieves the weighted average recall rate (WAR) of 70.11% and the unweighted average recall rate (UAR) of 69.53%. Compared with CRNN model, the performance of speech emotion recognition is significantly improved.