everyone
since 20 Jul 2024">EveryoneRevisionsBibTeXCC BY 4.0
Speech-driven 3D facial animation aims to synthesize 3D talking head animations with precise lip movements and rich stylistic expressions. However, existing methods exhibit two limitations: 1) they mostly focused on emotionless facial animation modeling, neglecting the importance of emotional expression, due to the lack of high-quality 3D emotional talking head datasets, and 2) several latest works treated emotional intensity as a global controllable parameter, akin to emotional or speaker style, leading to over-smoothed emotional expressions in their outcomes. To address these challenges, we first collect a 3D talking head dataset comprising five emotional styles with a set of coefficients based on the MetaHuman character model and then propose an end-to-end deep neural network, DEITalk, which conditions on speech and emotional style labels to generate realistic facial animation with dynamic expressions. To model emotional saliency variations in long-term audio contexts, we design a dynamic emotional intensity (DEI) modeling module and a dynamic positional encoding (DPE) strategy. The former extracts implicit representations of emotional intensity from speech features and utilizes them as local (high temporal frequency) emotional supervision, whereas the latter offers abilities to generalize to longer speech sequences. Moreover, we introduce an emotion-guided feature fusion decoder and a four-way loss function to generate emotion-enhanced 3D facial animation with controllable emotional styles. Extensive experimental results demonstrate that our method outperforms existing state-of-the-art methods. We recommend watching the video demo provided in our supplementary material for detailed results.