IF-MDM: Implicit Face Motion Diffusion Model for Compressing Dynamic Motion Latent

ICLR 2026 Conference Submission5635 Authors

15 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: video generation, talking head, speech, implicit motion, efficient
Abstract: This paper introduces an implicit face motion diffusion model (IF-MDM), a fully self-supervised framework for learning dynamic facial motion tailored for audio-driven talking head generation. IF-MDM eliminates the need for explicit human head priors by utilizing implicit motion templates, effectively addressing common visual alignment issues between the head and the background, as well as the computational challenges associated with conventional, heavy latent diffusion-based methods. To enhance speech-motion alignment, our approach incorporates (1) local flow modules for fine-grained motion modeling, (2) motion statistics guidance to manage head pose and facial expression intensity, and (3) framewise temporal guidance to accurately capture phoneme-level dependencies in lip movements. IF-MDM achieves real-time performance, generating realistic and high-fidelity 512x512 resolution videos at up to 45 fps. By capturing subtle dynamic motions such as eye blinking and torso movements purely through self-supervised learning, our model extends its applicability beyond human faces, offering generalizable talking head generation for various characters and animals. For more details on this work, including supplementary materials and code, please visit our project page (https://ifmdm.github.io).
Primary Area: generative models
Submission Number: 5635
Loading