Abstract: Label scarcity remains a significant challenge in speech emotion recognition (SER), often limiting the effectiveness of training models from scratch. Furthermore, speaker variability in acoustic representations hinders the generalization of emotion recognition systems. Prior research has demonstrated that mitigating speaker-related information can improve performance in SER tasks. In this work, we propose an efficient method to learn speaker-invariant representations by suppressing speaker identity from a pre-trained model (Wav2Vec2.0). Our approach enhances the robustness of emotion classification while addressing the limitations of limited labeled data and inter-speaker variability.
Paper Type: Long
Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining
Research Area Keywords: Sentiment Analysis, pre-trained model, gradient reversal layer
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: English
Keywords: Speech Emotion Recognition
Submission Number: 6076
Loading