Dubbing for Everyone: Cost and Data-Efficient Visual Dubbing using Neural Rendering Priors

TMLR Paper3824 Authors

02 Dec 2024 (modified: 24 Feb 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Visual dubbing is the process of generating lip motions of an actor in a video to synchronize with given audio. Visual dubbing allows video-based media to reach global audiences. Recent advances have made progress towards realizing this goal. However, existing models are either zero-shot and, therefore, lack quality, or they are expensive methods requiring off-putting user enrollment with lengthy and costly model training. Our key insight is to train a large, multi-person prior network, which can then be adapted to new users. This method allows for high-quality visual dubbing with just a few seconds of data, that enables video dubbing for any actor - from A-list celebrities to background actors at a much lower cost. We show that we achieve state-of-the-art in terms of visual quality and recognizability both quantitatively and qualitatively through two user studies. Our prior learning and adaptation method generalizes to limited data better than existing person-specific models. Our experiments on real-world, limited data scenarios find that our model is preferred over all existing methodologies.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: **Changelog** - We redefine visual dubbing to be the process of replacing just the lower half region of the face in response to the audio signal. We define a person-generic model as one that only sees a single reference frame of the target person and person-specific models as ones that have been trained or fine-tuned on at least some data of the target person. Consequently, we have removed the comparisons with RaDNeRF and Geneface, as they are not visual dubbing tasks. We have added more baselines to compensate for this, including DiffDub and MuseTalk (person generic) and have added person-specific models in the form of fine-tuned versions of DiffDub and TalkLip (labelled with -FT). We have updated figures and tables to reflect this. We have also made several changes relating to our claims and form, as requested by the reviewers. - We remove references to non-visual-dubbing works to avoid the confusion of mischaracterization. We add more detail to our limitations section to reflect that our model does not work well on only a single frame of data. We add a section discussing the source of several training datasets.
Assigned Action Editor: ~Lu_Jiang1
Submission Number: 3824
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview