Keywords: video generation, talking head, speech, implicit motion, efficient
Abstract: This paper introduces an implicit face motion diffusion model (IF-MDM), a fully self-supervised framework for learning dynamic facial motion tailored for audio-driven talking head generation. IF-MDM eliminates the need for explicit human head priors by utilizing implicit motion templates, effectively addressing common visual alignment issues between the head and the background, as well as the computational challenges associated with conventional, heavy latent diffusion-based methods. To enhance speech-motion alignment, our approach incorporates (1) local flow modules for fine-grained motion modeling, (2) motion statistics guidance to manage head pose and facial expression intensity, and (3) framewise temporal guidance to accurately capture phoneme-level dependencies in lip movements. IF-MDM achieves real-time performance, generating realistic and high-fidelity 512x512 resolution videos at up to 45 fps. By capturing subtle dynamic motions such as eye blinking and torso movements purely through self-supervised learning, our model extends its applicability beyond human faces, offering generalizable talking head generation for various characters and animals. For more details on this work, including supplementary materials and code, please visit our project page (https://ifmdm.github.io).
Primary Area: generative models
Submission Number: 5635
Loading