Best Practices for Noise-Based Augmentation to Improve the Performance of Deployable Speech-Based Emotion Recognition SystemsDownload PDF

Anonymous

16 Jun 2021 (modified: 05 May 2023)ACL ARR 2021 Jun Blind SubmissionReaders: Everyone
Abstract: Emotion recognition models are a key component of several downstream applications, such as mental health assessments. These models are usually trained on small, clean, and synthetically controlled datasets, which leads to high failure rates in presence of `unseen' background noises, promoting noise-overlay based adversarial attacks. Noisy data augmentation has aided robustness of speech recognition and classification models, wherein, the ground truth label remains consistent even in the presence of noise which, isn't always true for subjectively perceived emotion labels. In this work, we create realistic noisy samples of IEMOCAP, using multiple categories of environmental and synthetic noise. We evaluate how ground truth labels (human) and predicted labels (model) change as a function of these noise source introductions. We show that some commonly used noisy augmentation techniques, impact human perception of emotion, thus, falsifying the `clean ground truth label. Our experiments show that the performance of both, baseline, and even denoised emotion recognition models significantly declines on noisy samples as compared to that on the clean set. This performance degradation prevails when model is trained on a combination of clean and test set mismatched noisy samples. We investigate how using the above found `human-perceptible noise overlays can lead to inaccurate metrics when testing the model for robustness or vulnerability to adversarial attacks. Finally, we present a set of recommendations for noise-based augmentation of speech emotion datasets and for deploying the models trained using those datasets.
0 Replies

Loading