CONTRASTIVE UNSUPERVISED LEARNING FOR SPEECH EMOTION RECOGNITION

Mao Li

06 May 2021OpenReview Archive Direct UploadReaders: Everyone

Abstract: Speech emotion recognition (SER) is a key technology to enable more natural human-machine communication. How- ever, SER has long suffered from a lack of public large-scale labeled datasets. To circumvent this problem, we investi- gate how unsupervised representation learning on unlabeled datasets can benefit SER. We show that the contrastive pre- dictive coding (CPC) method can learn salient representations from unlabeled datasets, which improves emotion recogni- tion performance. In our experiments, this method achieved state-of-the-art concordance correlation coefficient (CCC) performance for all emotion primitives (activation, valence, and dominance) on IEMOCAP. Additionally, on the MSP- Podcast dataset, our method obtained considerable perfor- mance improvements compared to baselines.

0 Replies