Keywords: talking-head animation, autoregressive modeling, visual generation, generative models, multimodality
Abstract: Talking-head animation focuses on generating realistic facial videos from audio input. Following Generative Adversarial Networks (GANs), diffusion models have become the mainstream, owing to their robust generative capability. However, inherent limitations of the diffusion process often lead to inter-frame flicker and slow inference, hindering their practical use in talking-head animation. To address this, we introduce AvatarSync, an autoregressive framework on phoneme representations that generates realistic and controllable talking-head animations from a single reference image, driven by text or audio input. To mitigate flicker and ensure continuity, AvatarSync leverages an autoregressive pipeline that enhances temporal modeling. In addition, to ensure controllability, we introduce phonemes that are the basic units of speech sounds, and construct a many-to-one mapping from text/audio to phonemes, enabling precise phoneme-to-visual alignment. To further accelerate inference, we adopt a two-stage generation strategy that decouples semantic modeling from visual dynamics, incorporating a Phoneme-Frame Causal Attention Mask and a timestamp-aware adaptive strategy to support parallel inference. Extensive experiments conducted on Chinese (CMLR) and English (HDTF) benchmarks show that AvatarSync substantially reduces inter-frame flicker and outperforms existing methods in visual fidelity, temporal consistency, and computational efficiency, providing a scalable solution.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 18589
Loading