Emotion Recognition from Raw Speech Signals Using 2D CNN with Deep Metric Learning

Bogdan Mocanu, Ruxandra Tapu

Published: 2022, Last Modified: 01 Mar 2026ICCE 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper we have introduced a novel emotion recognition framework from raw speech signals. The system is based on ResNet architecture fed with spectrogram inputs. The CNN is further extended with a GhostVLAD feature aggregation layer that extracts a single, fixed size descriptor constructed at the level of the utterance. The system adopts a sentiment metric loss that integrates the relations between various classes of emotions. The experimental evaluation conducted on two publicly available databases: RAVDESS and CREMA-D validates the proposed methodology with average accuracy scores of 82% and 63%, respectively.

External IDs:dblp:conf/iccel/MocanuT22