TL;DR: We unlock robust mask-free visual dubbing (video lip sync) via a generative bootstrapping framework by learning from generated pseudo-paired data.
Abstract: Audio-driven visual dubbing aims to synchronize a video's lip movements with new speech but is fundamentally challenged by the lack of ideal training data: paired videos differing only in lip motion.
Existing methods circumvent this via mask-based inpainting. However, masking inevitably destroys spatiotemporal context, leading to identity drift and poor robustness (e.g., to occlusions), while also inducing lip-shape leakage that degrades lip sync.
To bridge this gap, we propose X-Dub, a novel two-stage generative bootstrapping framework leveraging powerful Diffusion Transformers to unlock mask-free dubbing.
Our core insight is to repurpose a mask-based inpainting model exclusively as a dedicated data generator to synthesize scalable, high-fidelity pseudo-paired data, which is subsequently utilized to train and bootstrap a robust, mask-free editing model as the final video dubber.
The final dubber is liberated from masking artifacts and leverages the complete video input for high-fidelity inference.
We further introduce timestep-adaptive multi-phase learning to disentangle conflicting objectives (structure, lip motion, and texture) across diffusion phases, facilitating stable convergence and advanced editing quality.
Additionally, we present X-DubBench, a benchmark for diverse scenarios.
Extensive experiments demonstrate that our method achieves state-of-the-art performance with superior lip sync, visual quality, and robustness.
Code, demos, and additional resources are available at https://github.com/KlingAIResearch/X-Dub.
Lay Summary: Imagine taking a video of almost any character and making it speak or sing with new audio, while keeping its identity, style, pose, and scene unchanged. This is the goal of visual dubbing, but existing methods often rely on masking and regenerating the mouth region, which can cause artifacts and failures in challenging cases such as occlusions, stylized characters, non-human faces, or changing lighting. We introduce X-Dub, a two-stage system that first creates useful training pairs from ordinary videos and then learns a mask-free way to edit the video directly. This allows X-Dub to synchronize diverse video characters with new speech or singing audio while preserving the original visual details. Our experiments show stronger lip synchronization, visual quality, identity preservation, and robustness than prior methods.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/KlingAIResearch/X-Dub
Primary Area: Applications->Computer Vision
Keywords: visual dubbing, lip synchronization, video editing, diffusion transformers
Originally Submitted PDF: pdf
Submission Number: 22928
Loading