Keywords: Talking Head Generation, Disentangled Representation, Real-time Animation
Abstract: Generating photorealistic and expressive talking heads from audio faces a generative trilemma, forcing a trade-off between real-time performance, lip-sync accuracy, and emotional fidelity. We propose RETA, an end-to-end framework that resolves this trilemma. The core of RETA is a novel strategy that disentangles the audio signal into two representations. First, for robust lip-sync, we use a 3DMM as a differentiable bridge, providing strong geometric guidance within an end-to-end model to prevent error accumulation. Second, for nuanced expression, a dynamic emotion embedding is learned from audio in a completely label-free manner; this is achieved by combining cross-modal knowledge distillation from a visual expert with a novel cross-synthesis consistency loss to ensure the representation is identity-agnostic. These representations are then hierarchically injected into a single-pass GAN generator for disentangled control. RETA establishes a new state-of-the-art (SOTA) by outperforming previous methods across all key metrics, while generating high-fidelity video at speeds exceeding 55 FPS. Code will be available upon publication.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 5010
Loading