GenFaceTalk: Generalizable One-Shot Talking-Head Generation for Diverse Styles

Fang Liu; Zhaoxu Sun; Hanlin Yan

GenFaceTalk: Generalizable One-Shot Talking-Head Generation for Diverse Styles

Fang Liu, Zhaoxu Sun, Hanlin Yan

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Talking-head, 3D Gaussion Splatting, One-shot

TL;DR: An end-to-end one-shot 3DGS-based framework supporting talking-head generation.

Abstract: One-shot 3D talking-head synthesis aims to generate realistic 3D facial animations from a single portrait image, driven by audio or video inputs. While recent advances in 3D-aware generation, particularly 3D Gaussian Splatting, have enabled high-fidelity modeling and real-time rendering, existing methods still struggle with critical challenges: (i) accurate identity preservation without multi-view supervision, and (ii) producing temporally coherent animations free from jitter. We propose GenFaceTalk, a novel end-to-end one-shot 3DGS-based framework supporting both audio- and video-driven scenarios without subject-specific training. The core insight of GenFaceTalk is to directly predict motion-disentangled FLAME parameters from the driving video, distilling the reliance on pre-trained 3D face reconstruction and sliding-window-based smoothing into the encoder during training. This design removes the need for face reconstruction at inference, yielding temporally consistent animation while preserving identity and fine-grained facial details. We further introduce a joint learning strategy that integrates FLAME-based motion priors with hierarchical appearance features from the source, guiding 3DGS learning in a spatially aligned and identity-aware manner. Our framework generalizes across diverse facial styles, including artistic and animal faces. Experiments demonstrate that GenFaceTalk outperforms state-of-the-art baselines in visual fidelity, temporal stability, identity preservation, and cross-domain generalization.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 7809

Loading