VividFace: A Robost and High-Fidelity Video Face Swapping Framework

Hao Shao; Shulun Wang; Yang Zhou; Guanglu Song; Dailan He; Zhuofan Zong; Shuo Qin; Yu Liu; Hongsheng Li

VividFace: A Robost and High-Fidelity Video Face Swapping Framework

Hao Shao, Shulun Wang, Yang Zhou, Guanglu Song, Dailan He, Zhuofan Zong, Shuo Qin, Yu Liu, Hongsheng Li

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY-SA 4.0

Keywords: video generation, face swapping, diffusion

Abstract: Video face swapping has seen increasing adoption in diverse applications, yet existing methods primarily trained on static images struggle to address temporal consistency and complex real-world scenarios. To overcome these limitations, we propose the first video face swapping framework, VividFace, a robust and high-fidelity diffusion-based framework. VividFace employs a novel hybrid training strategy that leverages abundant static image data alongside temporal video sequences, enabling it to effectively model temporal coherence and identity consistency in videos. Central to our approach is a carefully designed diffusion model integrated with a specialized VAE, capable of processing image-video hybrid data efficiently. To further enhance identity and pose disentanglement, we introduce and release the Attribute-Identity Disentanglement Triplet (AIDT) dataset, comprising a large-scale collection of triplets where each set contains three face images—two sharing the same pose and two sharing the same identity. Augmented comprehensively with occlusion scenarios, AIDT significantly boosts the robustness of VividFace against occlusions. Moreover, we incorporate advanced 3D reconstruction techniques as conditioning inputs to address significant pose variations effectively. Extensive experiments demonstrate that VividFace achieves state-of-the-art performance in identity preservation, temporal consistency, and visual realism, surpassing existing methods while requiring fewer inference steps. Our framework notably mitigates common challenges such as temporal flickering, identity loss, and sensitivity to occlusions and pose variations. The AIDT dataset, source code, and pre-trained weights will be released to support future research. The code and pretrained weights are available on the [project page](https://hao-shao.com/projects/vividface.html).

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 22880

Loading