Abstract: Talking face generation focuses on creating natural facial animations that align with the provided text or audio input. Current methods in this field primarily rely on facial landmarks to convey emotional changes. However, spatial key-points are valuable, yet limited in capturing the intricate dynamics and subtle nuances of emotional expressions due to their restricted spatial coverage. Consequently, this reliance on sparse landmarks can result in decreased accuracy and visual quality, especially when representing complex emotional states. To address this issue, we propose a novel method called Emotional Talking with Action Unit (ETAU), which seamlessly integrates facial Action Units (AUs) into the generation process. Unlike previous works that solely rely on facial landmarks, ETAU employs both Action Units and landmarks to comprehensively represent facial expressions through interpretable representations. Our method provides a detailed and dynamic representation of emotions by capturing the complex interactions among facial muscle movements. Moreover, ETAU adopts a multi-modal strategy by seamlessly integrating emotion prompts, driving videos, and target images, and by leveraging various input data effectively, it generates highly realistic and emotional talking-face videos. Through extensive evaluations across multiple datasets, including MEAD, LRW, GRID and HDTF, ETAU outperforms previous methods, showcasing its superior ability to generate high-quality, expressive talking faces with improved visual fidelity and synchronization. Moreover, ETAU exhibits a significant improvement on the emotion accuracy of the generated results, reaching an impressive average accuracy of 84% on the MEAD dataset.
Loading