Abstract: Recent works on audio-driven talking head synthesis using Neural Radiance Fields (NeRF) have achieved impressive results. However, due to inadequate pose and expression control caused by NeRF implicit representation, these methods still have some limitations, such as unsynchronized or unnatural lip movements, and visual jitter and artifacts. In this paper, we propose GaussianTalker, a novel method for audio-driven talking head synthesis based on 3D Gaussian Splatting. With the explicit representation property of 3D Gaussians, intuitive control of the facial motion is achieved by binding Gaussians to 3D facial models. GaussianTalker consists of two modules, Speaker-specific Motion Translator and Dynamic Gaussian Renderer. Speaker-specific Motion Translator achieves accurate lip movements specific to the target speaker through universalized audio feature extraction and customized lip motion generation. Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes to enhance facial detail representation via a latent pose, delivering stable and realistic rendered videos. Extensive experimental results suggest that GaussianTalker outperforms existing state-of-the-art methods in talking head synthesis, delivering precise lip synchronization and exceptional visual quality. Our method achieves rendering speeds of 130 FPS on NVIDIA RTX4090 GPU, significantly exceeding the threshold for real-time rendering performance, and can potentially be deployed on other hardware platforms.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: GaussianTalker is a typical multimodal technology that combines the three modalities of drive audio, 3D facical model and video to synthesize natural and realistic video by analyzing the audio signal to generate lip movements that match the speaker. An important goal of multimedia technology is to provide a richer and more natural interactive experience, and GaussianTalker can greatly enhance the user experience by creating a near-realistic audio-visual experience in a variety of application scenarios, such as digital avatars, virtual reality, interactive entertainment, and remote communication.
Supplementary Material: zip
Submission Number: 5519
Loading