ApoAvatar: Expressive Audio-Driven Avatar Generation via Refocused Audio-Pose Priors

Jingyu Lin; Chao Zhang; Wei Feng; Donghao Zhou; Shilei Wen; Lan Du; Cunjian Chen

ApoAvatar: Expressive Audio-Driven Avatar Generation via Refocused Audio-Pose Priors

Jingyu Lin, Chao Zhang, Wei Feng, Donghao Zhou, Shilei Wen, Lan Du, Cunjian Chen

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Generation, Audio Driven Avatar Animation

Abstract: Audio-driven human video generation has greatly improved lip synchronization. However, most methods still use audio mainly to control the mouth, while the relationship between speech rhythm and body motion remains weak. This often makes generated characters look unnatural. We present \textbf{ApoAvatar}, a diffusion-based framework that ties speaking style to motion dynamics. We introduce an Audio–Pose Prior Refocusing mechanism, which adjusts pose guidance based on audio intensity. Strong accents increase gesture magnitude, while quiet parts suppress unnecessary motion. We also design a frame-wise audio–video interaction module. It updates audio features using the current visual context and the refocused pose prior through a designed bidirectional cross-attention. This yields better short-term synchronization and motion coherence. The framework supports both pose-controlled and pose-free inference within one model. Extensive experiments on EMTD and HDTF show clear gains over strong baselines in lip–audio synchronization, gesture expressiveness, and overall motion naturalness.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 5696

Loading