RETA: Real-Time and Expressive Talking Head Animation without Emotion Label

Shijie Li; Siyuan Yang; Hang Yu; Weiyao Lin; Peng Di

RETA: Real-Time and Expressive Talking Head Animation without Emotion Label

Shijie Li, Siyuan Yang, Hang Yu, Weiyao Lin, Peng Di

14 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Talking Head Generation, Disentangled Representation, Real-time Animation

Abstract: Generating photorealistic and expressive talking heads from audio faces a generative trilemma, forcing a trade-off between real-time performance, lip-sync accuracy, and emotional fidelity. We propose RETA, an end-to-end framework that resolves this trilemma. The core of RETA is a novel strategy that disentangles the audio signal into two representations. First, for robust lip-sync, we use a 3DMM as a differentiable bridge, providing strong geometric guidance within an end-to-end model to prevent error accumulation. Second, for nuanced expression, a dynamic emotion embedding is learned from audio in a completely label-free manner; this is achieved by combining cross-modal knowledge distillation from a visual expert with a novel cross-synthesis consistency loss to ensure the representation is identity-agnostic. These representations are then hierarchically injected into a single-pass GAN generator for disentangled control. RETA establishes a new state-of-the-art (SOTA) by outperforming previous methods across all key metrics, while generating high-fidelity video at speeds exceeding 55 FPS. Code will be available upon publication.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 5010

Loading