Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face ConversationDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (\ie, audio and visual) spoken dialogue corpus containing 387 hours of approximately 10,000 dialogues, recorded based on the open domain dialogue dataset, TopicalChat. The MultiDialog contains parallel audio-visual recordings of conversation partners acting according to the given script with emotion annotations, which we expect to open up research opportunities in multimodal synthesis. Our Face-to-Face spoken dialogue model incorporates a textually pretrained large language model and adapts it into the audio-visual spoken dialogue domain by incorporating speech-text joint pretraining. Through extensive experiments, we validate the effectiveness of our model in facilitating a face-to-face conversation. All the data will be open-sourced.
Paper Type: long
Research Area: Dialogue and Interactive Systems
Contribution Types: Data resources
Languages Studied: English
Preprint Status: We are considering releasing a non-anonymous preprint in the next two months (i.e., during the reviewing process).
A1: yes
A2: yes
A3: yes
B: yes
C: yes
C2: yes
D: yes
E: yes
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview