Keywords: Multimodal dialogue dataset, Multimodal conditional dialogue generation, Spoken dialogue generation
TL;DR: We propose an expressive multimodal dialogue dataset with dialogue-level style annotations using an automated pipeline, then introduce explicit and implicit control in multimodal dialogue generation.
Abstract: The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multi-modal dialogue.
While current methods impressively generates realistic dialogue in speech and vision modalities, challenges remain in multi-modal conditional dialogue generation.
This paper focuses on the natural alignment between speech, vision, and text, aiming at expressive dialogue generation through multi-modal conditional control.
Since existing datasets lack the richness and diversity in dialogue expressiveness, we introduce a novel multi-modal dialogue annotation pipeline to exploit meaningful dialogues from movies and TV series with fine-grained annotations across multi-modalities.
The resultant dataset, MM-Dia, provides over 360 hours and 54,700 dialogues, facilitating the Multimodal Dialogue Generation task through explicit control over style-controllable dialogue speech synthesis.
While the proposed benchmark, MM-Dia-Bench, containing 309 dialogues that are highly expressive with visible dual/single speaker scenes, supporting the evaluation of implicit cross-modal control through downstream multi-modal dialogue generation tasks to assess the audio-visual style consistency across modalities.
Our experiments demonstrate the effectiveness of our data in enhancing style controllability and reveal limitations in current frameworks' ability to replicate human interaction expressiveness, providing new insights and challenges for multi-modal conditional dialogue generation. Code, demo and data will be released at: https://mmdiaiclr26.github.io/mmdiaiclr26/.
Primary Area: datasets and benchmarks
Submission Number: 24632
Loading