A-SMiLE: Affective Sparse Mixture-of-Experts Adapter with Multi-Task Learning for Spoken Dialogue Models
Abstract: Large language models have made remarkable progress in generating coherent and contextually relevant dialogue. However, they still struggle to capture fine-grained paralinguistic cues, limiting emotional coherence and contextual appropriateness. To address this challenge, we propose an Affective Sparse Mixture-of-experts adapter with multi-task Learning (A-SMiLE) to enhance the spoken dialogue model's capability for affective perception across multiple dimensions. Our approach integrates Valence, Arousal, and Dominance (VAD) modeling with response generation, thereby ensuring expressive and contextually aware interactions. Evaluations on DailyTalk and a hard-case benchmark demonstrate significant improvements over baselines in both emotion prediction and response generation. It shows the potential of cognitive-inspired affective modeling to enhance intelligent speech interactions on the spoken dialogue models.
External IDs:dblp:conf/interspeech/ChaoPNMNCC25
Loading