Abstract: The recent surge in deep learning has improved Speech Emotion Recognition (SER) model performance; however, ensuring robustness across diverse scenarios beyond the training dataset remains a problem. This challenge becomes pronounced in real-world situations characterized by noisy conditions, where model adaptability to unclean data is crucial. Despite ongoing efforts to develop noise-robust models, the lack of standardized evaluation protocols hampers fair comparisons among different models. This paper tackles this issue by introducing Robuser, a benchmarking procedure designed specifically for evaluating the robustness of SER models under noise. Robuser is a comprehensive open-source benchmark that can be applied to any speech dataset, focusing on diverse corruption types in two pivotal dimensions: additive background noise and various signal distortion corruptions, each in varying levels of severity. Furthermore, through the evaluation of a state-of-the-art SER model against this benchmark, we offer quantitative insights into the impact of the different corruption types and severity levels on performance. The baseline model reveals a notable performance degradation of up to 22.77% in Unweighted Accuracy (UA) and 20.32% in Weighted Accuracy (WA) on corrupted IEMOCAP, underscoring the substantial room for improvement in this domain. Our code is openly available at the following URL: https://github.com/BehavioralSignalTechnologies/ser_robustness.git
Loading