Multi-scale Aggregation Network for Speech Emotion Recognition

Published: 01 Jan 2023, Last Modified: 23 Jun 2025CSoNet 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Speech emotion recognition (SER) is a challenging task due to its difficulty in finding efficient representations of emotion in speech. Most conventional speech feature extraction methods tend to be highly sensitive to factors that are emotionally irrelevant, such as the speaker, speaking styles, and background noise, rather than capturing the underlying emotional nuances. The most efficiently used feature extraction method for SER is deep convolutional neural networks (CNN), which can extract high-level features from low-level features from speech signals. However, the majority of CNN-based approaches primarily leverage single-scale features extracted from the final network layer, often falling short of adequately encapsulating the diverse spectrum of emotional characteristics inherent in speech. This paper introduces a multi-scale feature aggregation (MSA) network based on a fully convolutional neural network of the feature pyramid network (FPN) family for SER. This network aggregates multi-scale features from different layers of the feature extractor via a top-down pathway and lateral connections. This methodology empowers our proposed network to encompass a more comprehensive and nuanced understanding of emotional information embedded in spoken language. Additionally, in light of the challenges posed by limited data and data imbalances inherent in speech emotion recognition, we adopt data augmentation techniques to generate supplementary training data samples. Our experimental evaluation conducted on the interactive emotional dyadic motion capture (IEMOCAP) dataset demonstrates the efficacy of the proposed model, revealing its capacity to significantly enhance the performance of SER.
Loading