Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech

ACL ARR 2024 June Submission206 Authors

07 Jun 2024 (modified: 22 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Human conversation is usually conducted with language, speech, and visual information. Each communication medium contains rich information and complementary to others, for example, speech (para-lingual) may contain vibe that is not well represented in language. %response is made from conversational history which contains a vibe there. Multimodal LLM consider multimodal information and aim to generate text responses. However, generating more natural and engaging speech response has received little attention even though response only with text cannot give a rich conversation experience. In this paper, we suggest a more human-like agent that makes a speech response based on the conversation mood and responsive style information. Our model is trained to generate text responses along with voice descriptions from multimodal conversation environment. With the voice description, the model generates speech covering para-lingual information. To achieve this goal, we first build a novel multi-sensory conversation dataset mainly focused on speech to enable conversational agents to generate natural speech communication. Then we propose our multimodal LLM based model for generating both text response and voice description. In experimental results, our model demonstrates the effectiveness of utilizing both visual and audio modalities in conversation and generating lively speech.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: Dialogue and Interactive Systems, Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 206
Loading