Speech Emotion Classification with Parallel Architecture of Deep Learning and Multi-Head Attention Transformer

An Hoang Nguyen, Kien Trang, Nguyen Gia Minh Thao, Bao Quoc Vuong, Long Ton That

Published: 2023, Last Modified: 29 Jul 2025SICE 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Speech is the most direct method of human communication with a high level of efficiency, which contains a lot of information about the speaker’s feelings. The ability to recognize and distinguish between different emotions with sentences is a necessary component in intelligent applications of human-computer interaction (HCI). For the purpose of creating a more natural and intuitive way of communication between humans and automation control systems, emotional expressions conveyed through signal forms need to be recognized and processed accordingly. In this paper, the authors propose to appropriately apply parallel Deep Learning SENet, CNN block, and Transformer with Multi-head Attention method to effectively distinguish the features of different emotional states in the user voice recording data. The speech record samples from an open-source RAVDESS dataset were applied to assess the performance of the training model during the research. The highest results of the proposed model have achieved 82.67% of average accuracy on the test set.