Make you said that: A motion robust multi-knowledge fusion framework for speaker-agnostic visual dubbing

Yilei Chen, Shengwu Xiong

Published: 2025, Last Modified: 30 Jul 2025Knowl. Based Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Speaker-agnostic visual dubbing technology seeks to synchronize lip movements in facial videos with an audio signal, requiring high precision and exceptional audio-visual fidelity. While many existing methods focus on audio-visual synchronization, they often rely on generative adversarial networks (GANs) to inpaint cropped facial areas based on audio cues. However, these methods can result in unrealistic facial textures or noticeable artifacts, especially when videos contain natural head movements. To overcome these challenges, we propose a novel framework that utilizes the 3D Morphable Model (3DMM) as an intermediate representation, decomposing the visual dubbing task into two independent sub-tasks: audio-driven 3D expression prediction and 3D face-guided neural rendering. Our framework introduces an innovative audio-visual synchronization network guided by knowledge priors, significantly improving synchronization quality. We also propose a Multi-facial Prior Fusion Texture Enhanced Render Network, which ensures texture consistency across facial regions and enhances robustness to head movements. By employing a multi-task learning framework, our method maximizes the use of reference image textures, significantly improving the realism of generated talking face videos. Extensive experiments demonstrate that our framework outperforms state-of-the-art techniques and sets a new benchmark for speaker-agnostic visual dubbing.

External IDs:dblp:journals/kbs/ChenX25