Keywords: Talking-head, 3D Gaussion Splatting, One-shot
TL;DR: An end-to-end one-shot 3DGS-based framework supporting talking-head generation.
Abstract: One-shot 3D talking-head synthesis aims to generate realistic 3D facial animations from a single portrait image, driven by audio or video inputs. While recent advances in 3D-aware generation, particularly 3D Gaussian Splatting, have enabled high-fidelity modeling and real-time rendering, existing methods still struggle with critical challenges: (i) accurate identity preservation without multi-view supervision, and (ii) producing temporally coherent animations free from jitter. We propose GenFaceTalk, a novel end-to-end one-shot 3DGS-based framework supporting both audio- and video-driven scenarios without subject-specific training. The core insight of GenFaceTalk is to directly predict motion-disentangled FLAME parameters from the driving video, distilling the reliance on pre-trained 3D face reconstruction and sliding-window-based smoothing into the encoder during training. This design removes the need for face reconstruction at inference, yielding temporally consistent animation while preserving identity and fine-grained facial details. We further introduce a joint learning strategy that integrates FLAME-based motion priors with hierarchical appearance features from the source, guiding 3DGS learning in a spatially aligned and identity-aware manner.
Our framework generalizes across diverse facial styles, including artistic and animal faces.
Experiments demonstrate that GenFaceTalk outperforms state-of-the-art baselines in visual fidelity, temporal stability, identity preservation, and cross-domain generalization.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7809
Loading