Abstract: The aim of person-generic Talking Face Generation (TFG) is to reconstruct realistic facial motions for arbitrary speakers consistent with a given speech. Previous generation methods struggle to focus on the content of lip movements. Although the introduction of the additional lip-reading pre-training model can address this problem, it often leads to a decline in visual quality, thereby diminishing the significance of the enhancement. To address this issue, we present MouthMotion, a framework that incorporates a novel textual branch to enhance visual feature extraction and compels motion learning to focus specifically on the mouth region. With the assistance of a carefully designed mouth-related text prompt, MouthMotion employs a mouth motion learning module based on Contrastive Language-Image Pre-training (CLIP) to learn mouth motions from different face frames. This process is supervised by a cosine similarity loss. To effectively fuse motion, face and speech latent codes within a joint learning space, we propose a motion-face learning module and a motion–speech learning module. We evaluate MouthMotion on the LRS2 and LRW datasets in terms of visual quality (PSNR, SSIM), lip sync (LSE-C, LSE-D), and reading intelligibility (WER, ACC) to validate the mouth motion capturing capabilities. Extensive qualitative and quantitative experiments demonstrate the superiority of our proposed method over other state-of-the-art methods.
Loading