EAA: Emotion-Aware Audio Large Language Models with Dual Cross-Attention and Context-Aware Instruction Tuning
Abstract: Understanding speech emotion through artificial intelligence (AI) is crucial for human-computer interaction and mental health monitoring. While audio large language models (ALLMs) excel in speech comprehension, they face challenges in accurately integrating emotional signals from acoustic and semantic features. Moreover, emotions often span dialogues, making sole reliance on current audio insufficient for comprehensive understanding. To address these challenges, we propose a novel emotion-aware audio large language model (EAA). Specifically, we design a dual cross-attention mechanism to fuse acoustic and semantic information for a more comprehensive emotional representation. Furthermore, we use context-aware instruction tuning by incorporating the current and immediately preceding utterances as contextual information, enhancing task understanding and emotion recognition. Our experimental results show that EAA outperforms existing ALLMs on the MELD dataset, improving accuracy by 11.4%.
External IDs:dblp:conf/interspeech/DuLZG25
Loading