Abstract: This paper proposes GaussianTalker, a novel framework for real-time generation of pose-controllable talking heads. It leverages the fast rendering capabilities of 3D Gaussian Splatting (3DGS) while addressing the challenges of directly controlling 3DGS with speech audio. GaussianTalker constructs a single 3DGS representation of the head and deforms it in sync with the audio. A key insight is to encode the 3D Gaussian attributes into a shared implicit feature representation, where it is merged with audio features to manipulate each Gaussian attribute. This design exploits the spatial information of the head and enforces interactions between neighboring points. The feature embeddings are then fed to a spatial-audio attention module, which predicts frame-wise offsets for the attributes of each Gaussian. This method is more stable than previous concatenation or multiplication approaches for manipulating the numerous Gaussians and their intricate parameters. Overall, GaussianTalker offers a promising approach for real-time generation of high-quality pose-controllable talking heads.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: This work contributes to multmedia processing by providing a superior method for synthesizing 3D talking portraits, thus mapping the relationship between 3D vision and speech audio. Our method focuses on improving the audial-vision mapping via disentangling speech-related motion from a talking portrait video, and thus enhancing the generation and manipulation of visual and audio media. With the improved fidelity and rapid rendering speed our GaussianTalker offers, we present a breakthrough in synthesizing and dynamically manipulating multimedia data.
Supplementary Material: zip
Submission Number: 5131
Loading