Make Audio Solely Drive Lip in Talking Face Video Synthesis

Xing Bai, Jun Zhou, Pengyuan Zhang, Ruipeng Hao

Published: 2024, Last Modified: 07 Jan 2026ICANN (3) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this work, we investigate the problem of synthesizing a talking face video which should be synchronized with a target speech segment. Although there has been significant progress on this task, the most successful approaches are still those that are unrelated to identity. The results generated by these methods cannot capture the unique speaking characteristics of the target individual. In this paper, we propose an effective framework that enables personalized and high-fidelity lip synchronization. The emphasis of our method lies in the way of data preprocessing and inference. Specifically, we find that convolutional neural networks (CNNs) are more inclined to learn lip motions from facial and throat muscle movements rather than speech features. Only by masking region related to lip shape changes as much as possible, CNNs can associate speech features with lip motions. Given original head pose information, the features of input audio signal are solely used to generate realistic images in our model. Extensive experiments demonstrate the effectiveness of our method in producing high-fidelity results. The code will be made publicly available.

External IDs:dblp:conf/icann/BaiZZH24