WarpFace: Revisiting Face Reenactment via Self-Supervised Motion Learning in Diffusion Models

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Face Animation, Motion Extraction, Diffusion Application
TL;DR: We propose a self-supervised diffusion framework for face reenactment that requires no human-specific priors, achieving expressive and controllable motion transfer.
Abstract: Face reenactment enables personalized motion transfer between identities and serves as a fundamental task in human animation. While recent diffusion-based approaches have achieved impressive visual quality, they often depend on human-centric inductive biases (e.g., landmark detectors), which limits their flexibility and scalability. In contrast, self-supervised GANs have demonstrated that meaningful motion representations can emerge directly from raw videos. This motivate us to introduce a novel self-supervised diffusion framework for face reenactment that eliminates the need for domain-specific priors. Our key insight is that diffusion models inherently encode rich motion cues, but naive extraction often leads to semantic collapse, where motion representations lose discriminability. To address this, we propose \textbf{WarpFace} with two core components: (1) Warping-enhanced Cross-Attention (\textit{WarpCA}), which incorporates geometry-aware warping within the attention mechanism to enable robust motion learning while preventing semantic collapse; and (2) a Multi-Group Motion Encoder (\textit{MGME}) that disentangles motion into structured subspaces for fine-grained control. Extensive experiments demonstrate that our method achieves expressive and accurate reenactment without relying on manual annotations or human-specific pretrained priors.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6867
Loading