Abstract: Creating expressive talking heads is crucial for multimedia applications involving virtual human. Existing approaches predominantly rely on facial landmarks to convey emotional changes. However, these spatial keypoints struggle to capture subtle emotional intricacies due to their limited spatial coverage, consequently decreasing accuracy and visual quality, particularly in emotion representation. To address this issue, we introduce a novel method called Emotional Talking with Action Unit (ETAU), which introduces the additional facial Action Units (AUs) to generate talking head video that accurately portray the target emotions. Unlike previous works, ETAU comprehensively quantify facial expressions through Action Units, which provides a detailed and dynamic representation of emotion. To the best of our knowledge, this work pioneers the integration of Action Units for emotional talking head generation. Extensive evaluations on the MEAD dataset showcase ETAU’s state-of-the-art performance with 21.89 PSNR and 0.68 SSIM. Critically, ETAU achieves significant improvement in emotion accuracy of the generated results, reaching 84%, confirming its feasibility in representing emotional expressions.
Loading