From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping

Xu He; Haoxian Zhang; Hejia Chen; Changyuan Zheng; Liyang Chen; Songlin Tang; Jiehui Huang; Xiaoqiang Liu; Pengfei Wan; Zhiyong Wu

From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping

Xu He, Haoxian Zhang, Hejia Chen, Changyuan Zheng, Liyang Chen, Songlin Tang, Jiehui Huang, Xiaoqiang Liu, Pengfei Wan, Zhiyong Wu

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We unlock robust mask-free visual dubbing (video lip sync) via a generative bootstrapping framework by learning from generated pseudo-paired data.

Abstract: Audio-driven visual dubbing aims to synchronize a video's lip movements with new speech but is fundamentally challenged by the lack of ideal training data: paired videos differing only in lip motion. Existing methods circumvent this via mask-based inpainting. However, masking inevitably destroys spatiotemporal context, leading to identity drift and poor robustness (e.g., to occlusions), while also inducing lip-shape leakage that degrades lip sync. To bridge this gap, we propose X-Dub, a novel two-stage generative bootstrapping framework leveraging powerful Diffusion Transformers to unlock mask-free dubbing. Our core insight is to repurpose a mask-based inpainting model exclusively as a dedicated data generator to synthesize scalable, high-fidelity pseudo-paired data, which is subsequently utilized to train and bootstrap a robust, mask-free editing model as the final video dubber. The final dubber is liberated from masking artifacts and leverages the complete video input for high-fidelity inference. We further introduce timestep-adaptive multi-phase learning to disentangle conflicting objectives (structure, lip motion, and texture) across diffusion phases, facilitating stable convergence and advanced editing quality. Additionally, we present X-DubBench, a benchmark for diverse scenarios. Extensive experiments demonstrate that our method achieves state-of-the-art performance with superior lip sync, visual quality, and robustness. Code, demos, and additional resources are available at https://github.com/KlingAIResearch/X-Dub.

Lay Summary: Imagine taking a video of almost any character and making it speak or sing with new audio, while keeping its identity, style, pose, and scene unchanged. This is the goal of visual dubbing, but existing methods often rely on masking and regenerating the mouth region, which can cause artifacts and failures in challenging cases such as occlusions, stylized characters, non-human faces, or changing lighting. We introduce X-Dub, a two-stage system that first creates useful training pairs from ordinary videos and then learns a mask-free way to edit the video directly. This allows X-Dub to synchronize diverse video characters with new speech or singing audio while preserving the original visual details. Our experiments show stronger lip synchronization, visual quality, identity preservation, and robustness than prior methods.

Originally Submitted Supplementary Material: zip

Link To Code: https://github.com/KlingAIResearch/X-Dub

Primary Area: Applications->Computer Vision

Keywords: visual dubbing, lip synchronization, video editing, diffusion transformers

Originally Submitted PDF: pdf

Submission Number: 22928

Loading