Keywords: speech emotion recognition, automatic speech recognition, large language model
TL;DR: This paper tackles Speech Emotion Recognition challenges using conversational context and transcripts from multiple ASR models and LLMs. It outperforms state-of-the-art on the IEMOCAP and MELD datasets.
Abstract: Speech Emotion Recognition (SER) is the task of automatically identifying emotions expressed in spoken language. With the rise of large language models (LLMs), many studies have applied them to SER, but several key challenges remain. Current approaches often focus on isolated utterances, overlooking the rich contextual information present in conversations and the dynamic nature of emotions. Additionally, most methods rely on transcripts from a single Automatic Speech Recognition (ASR) model, neglecting the variability in word error rates (WER) across different ASR systems. Furthermore, the optimal length of conversational context and the impact of prompt structure on SER performance have not been sufficiently explored. To tackle these challenges, we design models using ASR transcripts from multiple sources as input data. In addition, we integrate custom prompts and different context window lengths. Empirical evaluations demonstrate that our method outperforms state-of-the-art techniques on the IEMOCAP and MELD datasets, highlighting the importance of utilizing conversational context and the diversity of ASR in SER tasks. All codes from our experiments are publicly available.
Submission Number: 49
Loading