Audio Conditioned Continuous Facial Animation with Identity Preserving Dynamics

Siqi Li

Published: 03 Nov 2023, Last Modified: 06 Nov 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Synthesizing expressive talking heads from a single reference image and an audio clip demands temporally coherent motion and emotion-consistent facial dynamics. Many prior approaches discretize expression conditions or collapse control to a low-dimensional scalar, which induces neutral “waypoints,” discontinuities, and brittle transitions. We introduce AffectPilot, an audio-driven any-to-any framework that treats facial behavior as trajectories evolving on a learned continuous expression manifold. From speech, we extract frame-level affect descriptors that capture prosodic and timbral cues; a Mixture-of-Experts policy then selects and blends specialized controllers to produce identity-adaptive, continuously varying expression signals without detours through neutrality. A diffusion-based renderer, coupled with an identity-preserving adaptor, converts these signals into photorealistic frames, while temporal alignment modules enforce cross-modal coherence via rhythm- and phoneme-aware constraints and motion smoothness objectives. Evaluations on CREMA-D demonstrate strong emotional consistency, natural expression transitions, and stable timing, with performance that matches or surpasses recent baselines. Ablations confirm that the manifold-based trajectory planning, expert mixture, and diffusion rendering contribute complementarily. The resulting system delivers continuous, adaptive expression control and robust audio–visual synchronization for high-fidelity, any-to-any talking-head generation.