XTalker: Turn, Smile, and Speak in Controllable Talking Portrait Animation

10 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Portrait Animation, Parameter Repersentation, Controllability
Abstract: Audio-driven portrait animation enables lifelike talking faces from a static image and audio input, and has become a key task for creating realistic digital humans. Existing methods mainly follow either latent- or parameter-based paradigms, yet they either suffer from limited controllability, high computational cost, or lack broader expressivity (emotion and head diversity) beyond lip synchronization. To address these limitations, we present qualitative and quantitative analyses of parameter representations, focusing on the disentanglement of head keypoints, and discover that facial representation can be decomposed into three interpretable subspaces: lip-phoneme synchronization governed by audio envelope dynamics, emotion modulation conditioned on semantic features and reference labels, and head motion controlled by user-defined curves. Building on this insight, we propose XTalker, a fast and controllable generation framework based on flow matching. It employs a unified MM-DiT backbone to jointly encode portrait and audio signals, followed by three lightweight heads: a talking head for accurate lip movement driven by audio envelope, an emotion head for synthesizing facial expressions from emotion labels, and a pose head for animating diverse head trajectories. Extensive experiments show that XTalker runs in real-time and achieves competitive lip-sync accuracy while improving emotion and motion expressivity, making it well-suited for controllable portrait animation applications.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3738
Loading