Abstract: In this paper we have introduced a novel emotion recognition framework from raw speech signals. The system is based on ResNet architecture fed with spectrogram inputs. The CNN is further extended with a GhostVLAD feature aggregation layer that extracts a single, fixed size descriptor constructed at the level of the utterance. The system adopts a sentiment metric loss that integrates the relations between various classes of emotions. The experimental evaluation conducted on two publicly available databases: RAVDESS and CREMA-D validates the proposed methodology with average accuracy scores of 82% and 63%, respectively.
External IDs:dblp:conf/iccel/MocanuT22
Loading