Robust Representation Learning for Speech Emotion Recognition with Moment Exchange

Published: 2023, Last Modified: 13 Nov 2024APSIPA ASC 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Speech emotion recognition (SER) is essential for the machine to understand human intentions. Various deep neural network (DNN) based models are proposed but still suffer from overfitting on small training dataset, especially the identical distribution assumption is usually violated in practical datasets. Data augmentation with normalization for single instances is a common approach to train models more stably, where the moments (a.k.a. mean and variance) of the augmented data representation are removed as noise. To further improve the robustness, a label-perturbing data augmentation method Moment Exchange (MoEx) is introduced in this paper, forcing DNN models to pay attention to the difference in moments extracted from individual instances of each emotion class. Experiments on several classical model structures demonstrate the promising performance improvement with an extremely simple but effective implementation of MoEx.
Loading