Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

Fei Shen; Cong Wang; Junyao Gao; Qin Guo; Jisheng Dang; Jinhui Tang; Tat-Seng Chua

Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

Fei Shen, Cong Wang, Junyao Gao, Qin Guo, Jisheng Dang, Jinhui Tang, Tat-Seng Chua

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent advances in conditional diffusion models have shown promise for generating realistic TalkingFace videos, yet challenges persist in achieving consistent head movement, synchronized facial expressions, and accurate lip synchronization over extended generations. To address these, we introduce the \textbf{M}otion-priors \textbf{C}onditional \textbf{D}iffusion \textbf{M}odel (\textbf{MCDM}), which utilizes both archived and current clip motion priors to enhance motion prediction and ensure temporal consistency. The model consists of three key elements: (1) an archived-clip motion-prior that incorporates historical frames and a reference frame to preserve identity and context; (2) a present-clip motion-prior diffusion model that captures multimodal causality for accurate predictions of head movements, lip sync, and expressions; and (3) a memory-efficient temporal attention mechanism that mitigates error accumulation by dynamically storing and updating motion features. We also introduce the {TalkingFace-Wild} dataset, a multilingual collection of over 200 hours of footage across 10 languages. Experimental results demonstrate the effectiveness of MCDM in maintaining identity and motion continuity for long-term TalkingFace generation.

Lay Summary: TalkingFace videos—where a static photo of a person is animated to speak and move naturally—are becoming increasingly popular in applications such as virtual assistants, films, and education. However, it is still challenging to generate long videos where the head movement, facial expressions, and lip sync remain natural and consistent over time. In this paper, we present a new method called Motion-priors Conditional Diffusion Model (MCDM), which uses both past and current video information to better predict how a person should move and speak in each frame. Our model also introduces an efficient way to remember and update motion patterns as the video progresses, helping to avoid common errors like unnatural movements or drifting faces. To train and test this approach, we built a large multilingual video dataset with over 200 hours of footage in 10 languages. Our experiments show that MCDM produces more realistic and consistent TalkingFace videos, opening new possibilities for high-quality, long-form animations.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Primary Area: Applications->Computer Vision

Keywords: Diffusion Model, TalkingFace, Pose

Submission Number: 14128

Loading