Dataset Construction and Effectiveness Evaluation of Spoken-Emotion Recognition for Human Machine Interaction

Published: 01 Jan 2025, Last Modified: 10 Jun 2025IEEE Access 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The widespread use of large language models (LLMs) and voice-based agents has rapidly expanded Human-Computer Interaction (HCI) through spoken dialogue. To achieve more natural communication, nonverbal cues—especially those tied to emotional states—are critical and have been studied via deep learning. However, three key challenges persist in existing emotion recognition datasets: 1) most assume human-to-human interaction, neglecting shifts in speech patterns when users address a machine, 2) many include acted emotional expressions that differ from genuine internal states, and 3) even non-acted datasets often rely on third-party labels, creating potential mismatches with speakers’ actual emotions. Prior studies report that agreement between external labels and speakers’ internal states can be as low as 60–70%. To address these gaps, we present the VR-Self-Annotation Emotion Dataset (VSAED), consisting of 1,352 naturally induced and non-acted Japanese utterances (1.5 hours). Each utterance is labeled with self-reported internal emotional states spanning six categories. We investigated: 1) how effectively non-acted, machine-oriented speech conveys internal emotions, 2) whether speakers alter expressions when aware of an emotion recognition system, and 3) whether specific conditions yield notably high accuracy. In experiments using a HuBERT-based classifier, we achieve around 40% recognition accuracy, underscoring the complexity of capturing subtle internal emotions. These findings highlight the importance of domain-specific datasets for human-machine interactions.
Loading