ProDub: Progressive Growing of Facial Dubbing Networks for Enhanced Lip Sync and Fidelity

Published: 2024, Last Modified: 05 Nov 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Facial dubbing has attracted growing research interests due to its creative and practical applications. An ideal facial dubbing video should exhibit accurate lip-sync and high visual quality. However, prior methods fall short in fully exploring the relationship between two pairs of critical elements: distinguishing mouth shape and texture for accurate lip-sync performance; aligning the driving audio and high-frequency details for better visual quality. To address these challenges, we propose a progressive framework ProDub for this task. Specifically, we propose an audio-supervised contrastive approach to disentangle the mouth shape and texture, along with a novel lip-shape-aware loss as a constraint for producing accurate lip-sync. For high-quality visual output, a lip-aware temporal-enhanced network is designed to improve the lip details while ensuring temporal coherency based on a learned prior. Extensive experiments demonstrate that our ProDub improves lip-sync by 10.7% and visual quality by 28.5% compared to state-of-the-art methods . 1
Loading