A Three-Stage Framework for Speaker-Invariant Speech Emotion Recognition

A Three-Stage Framework for Speaker-Invariant Speech Emotion Recognition

ACL ARR 2025 May Submission6076 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Label scarcity remains a significant challenge in speech emotion recognition (SER), often limiting the effectiveness of training models from scratch. Furthermore, speaker variability in acoustic representations hinders the generalization of emotion recognition systems. Prior research has demonstrated that mitigating speaker-related information can improve performance in SER tasks. In this work, we propose an efficient method to learn speaker-invariant representations by suppressing speaker identity from a pre-trained model (Wav2Vec2.0). Our approach enhances the robustness of emotion classification while addressing the limitations of limited labeled data and inter-speaker variability.

Paper Type: Long

Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining

Research Area Keywords: Sentiment Analysis, pre-trained model, gradient reversal layer

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Publicly available software and/or pre-trained models

Languages Studied: English

Keywords: Speech Emotion Recognition

Submission Number: 6076

Loading