SyncTalklip: Highly Synchronized Lip-Readable Speaker Generation with Multi-Task Learning

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Talking Face Generation (TFG) reconstructs facial motions concerning lips given speech input, which aims to generate high-quality, synchronized, and lip-readable videos. Previous efforts have achieved success in generating quality and synchronization, and recently, there has been an increasing focus on the importance of intelligibility. Despite these efforts, there remains a challenge in achieving a balance among quality, synchronization, and intelligibility, often resulting in trade-offs that compromise one aspect in favor of another. In light of this, we propose SyncTalklip, a novel dual-tower framework designed to overcome the challenges of synchronization while improving lip-reading performance. To enhance the performance of SyncTalklip in both synchronization and intelligibility, we design AV-SyncNet, a pre-trained multi-task model, aiming to achieve a dual-focus on synchronization and intelligibility. Moreover, we propose a novel cross-modal contrastive learning bringing audio and video closer to enhance synchronization. Experimental results demonstrate that SyncTalklip achieves state-of-the-art performance in quality, intelligibility, and synchronization. Furthermore, extensive experiments have demonstrated our model's generalizability across domains. The code and demo is available at \url{https://sync-talklip.github.io }.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: We work on Talking Face Generation (TFG) tasks. We propose SyncTalklip, a novel framework designed to overcome the challenges of synchronization while improving lip-reading performance. To enhance SyncTalklip's performance in both synchronization and intelligibility, we propose SyncAvhubert, a pretrained model, which can better encode semantic-aligned embeddings from audio and video modality. SyncAvhubert can be thought of as a customized distance space for SyncTalklip. Moreover, we propose a novel cross-modal contrastive learning bringing audio and video closer to enhance synchronization. SyncTalklip achieves State-of-the-art (SOTA) performance in quality, intelligibility, and synchronization.
Supplementary Material: zip
Submission Number: 3679
Loading