Exploiting Annotators' Typed Description of Emotion Perception to Maximize Utilization of Ratings for Speech Emotion Recognition

Abstract: The decision of ground truth for speech emotion recognition (SER) is still a critical issue in affective computing tasks. Previous studies on emotion recognition often rely on consensus labels after aggregating the classes selected by multiple annotators. It is common for a perceptual evaluation conducted to annotate emotional corpora to include the class “other,” allowing the annotators the opportunity to describe the emotion with their own words. This practice provides valuable emotional information, which, however, is ignored in most emotion recognition studies. This paper utilizes easy-accessed natural language processing toolkits to mine the sentiment of these typed descriptions, enriching and maximizing the information obtained from the annotators. The polarity information is combined with primary and secondary annotations provided by individual evaluators under a label distribution framework, creating a complete representation of the emotional content of the spoken sentences. Finally, we train multitask learning SER models with existing learning methods (soft-label, multi-label, and distribution-label) to show the performance of the novel ground truth in the MSP-Podcast corpus.
0 Replies
Loading