Submission Type: Short Papers (up to 4 pages)
Keywords: Facial part transfer, Video editing, Diffusion autoencoders
TL;DR: A training-free method for motion-consistent facial part transfer in videos that blends DiffAE semantic features, landmark-guided alignment, stochastic detail recovery, and regional flow-guided temporal attention.
Abstract: Facial part transfer in videos, such as swapping eyes, eyebrows, nose, or mouth between identities, remains largely underexplored compared to full-face video swapping. While several works study facial part transfer in images, extending this task to videos introduces additional challenges: the transferred part must move naturally with the target’s motion and expression while remaining visually consistent with the surrounding face. Directly injecting reference appearance often leads to pasted-looking artifacts or temporal inconsistencies across frames. Moreover, no dedicated benchmark currently exists for evaluating facial part transfer in videos.
We propose V-PartSwap, a training-free, reference-guided framework for video facial part transfer built upon Diffusion Autoencoders (DiffAE). Our key insight is that DiffAE’s reconstruction-based latent space enables localized semantic blending while preserving non-edited regions. We inject reference appearance via region-restricted semantic feature blending in the DiffAE encoder, align reference parts to the target’s facial motion using landmark-driven thin-plate spline (TPS) warping, and enforce temporal coherence with region-guided flow-based attention. To facilitate systematic evaluation, we construct a new benchmark for facial part transfer in videos.
Extensive experiments show that our method produces visually coherent edits with improved motion alignment and competitive video quality compared to state-of-the-art editing methods.
Submission Number: 11
Loading