Abstract: Emotional speech recognition is a challenging task for modern systems. The presence of emotions significantly changes the characteristics of speech. In this paper, we propose a novel approach for emotional speech recognition (EMO-AVSR). The proposed approach uses visual speech data to detect a person’s emotion first, followed by processing of speech by one of the pre-trained emotional audio-visual speech recognition models. We implement these models as a combination of spatio-temporal network for emotion recognition and a cross-modal attention fusion for automatic audio-visual speech recognition. We present experimental investigation that shows how different emotions (happy, anger, disgust, fear, sad, and neutral), valence (positive, neutral, and negative) and binary (emotional and neutral) affect automatic audio-visual speech recognition. The evaluation on CREMA-D data demonstrates up to 7.3% absolute accuracy improvement compared to the classical approaches.
Loading