ERVQ: Leverage Residual Vector Quantization for Speech Emotion Recognition

Published: 01 Jan 2024, Last Modified: 15 May 2025ISCSLP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Although speech pre-trained models (PTM) have shown remarkable performance in speech emotion recognition (SER), they are constructed for general tasks and exhibit limitations in capturing emotional related features. Recently, some popular text-to-speech models utilize residual vector quantization (RVQ) to effectively embed multi-scale speech detail information, resulting in significant speech reconstruction quality. Inspired by this success, we explore RVQ's potential for emotional representation learning. To enhance ERVQ's ability to capture emotional features, we adopt modification and employ content alignment on it. In this paper, we present a novel perspective on SER by introducing enhanced RVQ for emotion recognition, called Emotion RVQ (ERVQ). Based on experimental results, our approach achieves State-of-the-Art (SOTA) in both in-domain and out-of-domain SER scenarios, proving its outperformed emotional representation embedding demonstrating compared to other speech PTMs.
Loading