Keywords: Visual dubbing, Diffusion Transformers, Contextual learning
Abstract: Audio-driven visual dubbing aims to synchronize a video's lip movements with new speech, but is fundamentally challenged by the lack of real-world paired training data. Existing methods circumvent this with a mask-based inpainting paradigm, where incomplete context forces models to simultaneously hallucinate missing content (e.g., occlusions) and sync lips, leading to visual artifacts, identity drift, and poor synchronization. In this work, we propose a novel self-bootstrapping paradigm that reframes visual dubbing from an under-specified inpainting task into a well-conditioned video-to-video editing problem. Our approach utilizes a Diffusion Transformer to first generate its own ideal training data: a lip-altered companion video for each sample, forming a context-rich pair with the original. An editor is then trained on these pairs, leveraging the complete and aligned video context to focus solely on precise, audio-driven lip modifications. This context-rich conditioning allows our method to achieve state-of-the-art performance, yielding highly accurate lip sync, faithful identity preservation, and exceptional robustness against challenging in-the-wild scenarios like occlusions and dynamic lighting. We further introduce a timestep-adaptive multi-phase learning strategy that aligns diffusion stages with visual hierarchies, significantly enhancing contextual learning and dubbing quality. Additionally, we propose ContextDubBench, a comprehensive benchmark dataset for robust evaluation in diverse and challenging practical application scenarios.
Our visualizations are available at the anonymous page x-dub-lab.github.io, and code will be released to benefit the community.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8838
Loading