V-PartSwap: Motion-Consistent Facial Part Transfer in Videos via Alignment-Aware Diffusion

Weng Ian Chan; Yuantian Huang; Xingchao Yang; Fumio Okura; Takafumi Taketomi

V-PartSwap: Motion-Consistent Facial Part Transfer in Videos via Alignment-Aware Diffusion

Weng Ian Chan, Yuantian Huang, Xingchao Yang, Fumio Okura, Takafumi Taketomi

Published: 24 Mar 2026, Last Modified: 24 Mar 2026CVPR 2026 Workshop VGBEEveryoneRevisionsBibTeXCC BY 4.0

Submission Type: Short Papers (up to 4 pages)

Keywords: Facial part transfer, Video editing, Diffusion autoencoders

TL;DR: A training-free method for motion-consistent facial part transfer in videos that blends DiffAE semantic features, landmark-guided alignment, stochastic detail recovery, and regional flow-guided temporal attention.

Abstract: Facial part transfer in videos, such as swapping eyes, eyebrows, nose, or mouth between identities, remains largely underexplored compared to full-face video swapping. While several works study facial part transfer in images, extending this task to videos introduces additional challenges: the transferred part must move naturally with the target’s motion and expression while remaining visually consistent with the surrounding face. Directly injecting reference appearance often leads to pasted-looking artifacts or temporal inconsistencies across frames. Moreover, no dedicated benchmark currently exists for evaluating facial part transfer in videos. We propose V-PartSwap, a training-free, reference-guided framework for video facial part transfer built upon Diffusion Autoencoders (DiffAE). Our key insight is that DiffAE’s reconstruction-based latent space enables localized semantic blending while preserving non-edited regions. We inject reference appearance via region-restricted semantic feature blending in the DiffAE encoder, align reference parts to the target’s facial motion using landmark-driven thin-plate spline (TPS) warping, and enforce temporal coherence with region-guided flow-based attention. To facilitate systematic evaluation, we construct a new benchmark for facial part transfer in videos. Extensive experiments show that our method produces visually coherent edits with improved motion alignment and competitive video quality compared to state-of-the-art editing methods.

Submission Number: 11

Loading