- Keywords: self supervised learning, speech representations, speech embeddings, contrastive loss, emotion classification
- TL;DR: Learning speech representations in self supervised setting and their evaluation on emotion classification task in different languages.
- Abstract: In this paper, we propose a technique for learning speech representations or embeddings in a self supervised manner, and show their performance on emotion classification task. We also investigate the usefulness of these embeddings for languages different from the pretraining corpus. We employ a convolutional encoder model and contrastive loss function on augmented Log Mel spectrograms to learn meaningful representations from an unlabelled speech corpus. Emotion classification experiments are carried out on SAVEE corpus, German EmoDB, and CaFE corpus. We find that: (1) These pretrained embeddings perform better than MFCCs, openSMILE features and PASE+ encodings for emotion classification task. (2) These embeddings improve accuracies in emotion classification task on languages different from that used in pretraining thus confirming language agnostic behaviour.
- Double Submission: Yes