Abstract: Recent diffusion-based talking face generation models have demonstrated impressive potential in synthesizing videos that accurately match a speech audio clip with a given reference identity. However, existing approaches still encounter significant challenges due to uncontrollable factors, such as inaccurate lip-sync, inappropriate head posture and the lack of fine-grained control over facial expressions. In order to introduce more face-guided conditions beyond speech audio clips, a novel two-stage training framework Playmate is proposed to generate more lifelike facial expressions and talking faces. In the first stage, we introduce a decoupled implicit 3D representation along with a meticulously designed motion-decoupled module to facilitate more accurate attribute disentanglement and generate expressive talking videos directly from audio cues. Then, in the second stage, we introduce an emotion-control module to encode emotion control information into the latent space, enabling fine-grained control over emotions and thereby achieving the ability to generate talking videos with desired emotion. Extensive experiments demonstrate that Playmate not only outperforms existing state-of-the-art methods in terms of video quality, but also exhibits strong competitiveness in lip synchronization while offering improved flexibility in controlling emotion and head pose. The code will be available at https://github.com/Playmate111/Playmate.
Lay Summary: Recent talking face generation models can create videos that match a given speech audio clip to a specific person, but they still face challenges like inaccurate lip-sync, unnatural head positions, and limited control over facial expressions. To solve these issues, a new method called Playmate has been developed. This two-stage framework aims to produce more realistic facial expressions and talking faces.
In the first stage, Playmate uses a special 3D representation and a motion-decoupled module to better separate and accurately generate facial attributes from audio.
In the second stage, it adds an emotion-control feature, allowing for precise adjustments of emotions in the generated videos.
Tests show that Playmate surpasses current top methods in video quality and lip-sync accuracy, while also offering greater flexibility in controlling emotions and head poses.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/Playmate111/Playmate
Primary Area: Applications->Computer Vision
Keywords: Diffusion, Portrait Animation, Transformer, Audio-driven
Submission Number: 2460
Loading