EmoVox: Continuous Affective Generation for Identity-Adaptive Talking Faces from Audio

Siqi Li

Published: 01 Sept 2024, Last Modified: 06 Nov 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: The synthesis of photorealistic talking faces from audio requires nuanced modeling of both identity-specific facial structure and time-varying emotional expressions. Existing methods frequently discretize expression space or impose artificial neutral transitions, resulting in mechanistic animation that lacks natural motion dynamics. We present EmoVox, a neural framework that learns continuous, audio-driven expression trajectories in a high-dimensional latent space of facial dynamics. Our system processes audio input through a hierarchical feature extractor that captures both prosodic and semantic cues, which are then mapped to continuous expression parameters via an adaptive attention mechanism. This approach eliminates the need for neutral-state interpolation and enables smooth, context-aware expression transitions. A key innovation is our identity-preserving neural renderer that maintains subject-specific facial characteristics while generating emotionally congruent animations. Temporal coherence is enforced through learned synchronization constraints that align visual outputs with audio rhythms and phonetic content. Comprehensive evaluations on MEAD datasets demonstrate superior performance in emotional authenticity, motion naturalness, and identity preservation compared to state-of-the-art methods. Our ablation studies confirm that the continuous representation learning, adaptive attention mechanism, and synchronization modules each contribute significantly to the overall performance. The framework provides an effective solution for generating high-fidelity talking faces with natural emotional dynamics and robust cross-identity generalization capabilities.