Abstract: Majority of speech emotion recognition (SER) systems are developed using databases with simulating speech performed by professional actors. Whereas, concerning real-world deployment, the SER inputs are mostly spontaneous utterances. Several SER researchers reported that the performance of SER models developed using acted emotional data degrades for spontaneous inputs. In this work, we improve the SER performance under the elicitation-based data expression mismatch scenarios by utilizing multi-task learning (MTL) with data expression recognition as the auxiliary task. We use the ECAPA- TDNN architecture with MFCCs and wav2vec 2.0 pre-trained embeddings as features. We conduct this study on the IEMOCAP and BAUM-1 databases. The proposed MTL-based method achieves state-of-the-art performance on the SER task. Further, we conduct an emotion-specific analysis and show that the data expression knowledge mostly helps to classify the highly aroused emotions.
0 Replies
Loading