A Multimodal, Multi-Turn Large Speech-Language Model for Real-Time Emotion Tracking and Empathetic Responding

Published: 01 Aug 2025, Last Modified: 26 Aug 2025SpeechAI TTIC 2025 OralorPosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Model, Speech Emotion Recognition, Empathetic Responding, Multi-Turn Conversation, Real-Time AI, Speech and Language AI
TL;DR: We are building a large speech-language model that can perceive emotion in your voice and respond with empathy in real-time to create more natural conversations.
Presentation Preference: Open to it if recommended by organizers
Abstract: Enhancing human-computer interaction to be as natural as human-to-human conversation requires models that can perceive and appropriately react to nuanced emotional cues. A primary challenge lies in recognizing emotion from a user's speech and generating empathetic responses in real-time to maintain conversational flow. To address this, we propose a novel multimodal large speech-language model capable of real-time emotion tracking and empathetic responding within multi-turn dialogues. Our method utilizes a three-stage training framework to progressively build the model's capabilities. The initial stage involves supervised finetuning on text and emotion-based objectives. This is followed by an unsupervised finetuning stage to align speech and text embeddings. The final stage employs supervised multi-turn finetuning to effectively process conversational history and context. The model's architecture integrates a speech encoder and a large language model , and is trained on benchmark multimodal datasets including MELD, CMU-MOSEI, and IEMOCAP. This work contributes an end-to-end solution for developing more socially aware and emotionally intelligent conversational agents.
Submission Number: 9
Loading