Abstract: Speech-driven 3D facial animation aims to synthesize 3D talking head animations with precise lip movements and rich stylistic expressions. However, existing methods exhibit two limitations: 1) they mostly focused on emotionless facial animation modeling, neglecting the importance of emotional expression, due to the lack of high-quality 3D emotional talking head datasets, and 2) several latest works treated emotional intensity as a global controllable parameter, akin to emotional or speaker style, leading to over-smoothed emotional expressions in their outcomes. To address these challenges, we first collect a 3D talking head dataset comprising five emotional styles with a set of coefficients based on the MetaHuman character model and then propose an end-to-end deep neural network, DEITalk, which conditions on speech and emotional style labels to generate realistic facial animation with dynamic expressions. To model emotional saliency variations in long-term audio contexts, we design a dynamic emotional intensity (DEI) modeling module and a dynamic positional encoding (DPE) strategy. The former extracts implicit representations of emotional intensity from speech features and utilizes them as local (high temporal frequency) emotional supervision, whereas the latter offers abilities to generalize to longer speech sequences. Moreover, we introduce an emotion-guided feature fusion decoder and a four-way loss function to generate emotion-enhanced 3D facial animation with controllable emotional styles. Extensive experimental results demonstrate that our method outperforms existing state-of-the-art methods. We recommend watching the video demo provided in our supplementary material for detailed results.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: Speech-driven 3D facial animation (audio-to-3Dface) has garnered significant attention both from academia and industry recently, which aims to synthesize realistic facial movements of 3D characters based on arbitrary speech input. However, existing audio-to-3Dface researches suffer from limitations in emotional expressions due to the lack of high-quality emotional 3D talking head datasets or inappropriate modeling of emotional intensity. To overcome them, we created an audiovisual dataset of emotional 3D talking head and propose an end-to-end deep neural network, DEITalk, that takes audio and emotional style labels as input to generate realistic 3D facial animation with dynamic expressions. To the best of our knowledge, DEITalk is the first attempt to model the dynamic emotional intensity by learning a speech-expression joint embedding space,involving two modalities: audio and video. In addition, DEITalk is a lightweight model capable of real-time generation of facial animations, it can be employed in virtual reality applications to offer users immersive interactions with 3D virtual characters when combined with large language models and text-to-speech technologies. In short, our work aims to advance multimedia technology, laying the groundwork for more intelligent and interactive multimedia applications.
Supplementary Material: zip
Submission Number: 3546
Loading