Disentanglement Network: Disentangle the Emotional Features from Acoustic Features for Speech Emotion Recognition

Zhichen Yuan, C. L. Philip Chen, Shuzhen Li, Tong Zhang

Published: 01 Jan 2024, Last Modified: 20 May 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Speech emotion recognition plays a crucial role in human-computer interaction. However, data distribution of speech signals varies among individuals for emotion recognition. It may guide models to focus more on identity information rather than emotional information, which impairs the generalization ability of models. To address this issue, this paper proposes a novel Disentanglement Network (DTNet) to disentangle emotional features from acoustic features. Specifically, DTNet first captures hidden identity features from acoustic features through an identity-aware module. Then, we design a disentanglement module to disentangle emotional features from acoustic features within the constraints of a reconstruction module and the hidden identity features. These modules enable the DTNet to extract more discriminative emotional features for emotion recognition. Experimental results on both speaker-independent and speaker-dependent settings have proven the effectiveness of DTNet, and this method achieves an unweighted accuracy (UA) of 74.8% on the IEMOCAP dataset and UA of 95.5% on the Emo-DB dataset, outperforming the state-of-the-art methods on both datasets.