Abstract: Current co-speech motion generation approaches usually focus on upper body gestures following speech contents only, while lacking supporting the elaborate control of synergistic full-body motion based on text prompts, such as {\it talking while walking}. The major challenges lie in 1) the existing speech-to-motion datasets only involve highly limited full-body motions, making a wide range of common human activities out of training distribution; 2) these datasets also lack annotated user prompts. To address these challenges, we propose SynTalker, which utilizes the off-the-shelf text-to-motion dataset as an auxiliary for supplementing the missing full-body motion and prompts. The core technical contributions are two-fold. One is the multi-stage training process which obtains an aligned embedding space of motion, speech, and prompts despite the significant distributional mismatch in motion between speech-to-motion and text-to-motion datasets. Another is the diffusion-based conditional inference process, which utilizes the separate-then-combine strategy to realize fine-grained control of local body parts. Extensive experiments are conducted to verify that our approach supports precise and flexible control of synergistic full-body motion generation based on both speeches and user prompts, which is beyond the ability of existing approaches. The code is released on (link will be published upon acceptance).
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: We study prompt-based co-speech motion generation, which involves audio and text as inputs, and human motion as outputs. Furthermore, co-speech motion generation is among the central tasks in creating digital talking avatars, whose applications is wide-spread in the multimedia industry. Due to these reasons, we believe that our work has significant contributions to multimedia/multimodal processing.
Supplementary Material: zip
Submission Number: 604
Loading