Abstract: Emotions are essential for human communication as they reflect our inner states and influence our actions. Today, emotions provide crucial information to many applications, from virtual assistants to security systems, mood-tracking wearable devices, and autism robots. The speech emotion recognition (SER) model must be lightweight to run on varying devices with limited computational power. This research investigates the performance of music-related features for SER based on the auditory and neuropsychology evidence about the connection of emotional speech and music in human perception. Unlike prior works on low-level descriptors that primarily focus on differentiating human speech production, our method employs features ex-tracted directly from raw speech signals through Discrete Fourier Transform and Constant-Q Transform. These features represent the perceptual pitches and timbre characteristics of the human voice. The 10-fold cross-validation results show that our method improves the accuracy of the audio feature-based approach on RAVDESS, CREMA-D and IEMOCAP datasets. Findings from the ablation study imply the significance of perceptual pitch, the perceptual loudness and the combination of pitch and timbre features in building a robust SER model. Compared to pretrained deep learning embeddings, our method demonstrates its generalizability and high efficiency despite a much smaller model size.
0 Replies
Loading