Keywords: Video Generation, Audio Driven Avatar Animation
Abstract: Audio-driven human video generation has greatly improved lip synchronization. However, most methods still use audio mainly to control the mouth, while the relationship between speech rhythm and body motion remains weak. This often makes generated characters look unnatural. We present \textbf{ApoAvatar}, a diffusion-based framework that ties speaking style to motion dynamics. We introduce an Audio–Pose Prior Refocusing mechanism, which adjusts pose guidance based on audio intensity. Strong accents increase gesture magnitude, while quiet parts suppress unnecessary motion. We also design a frame-wise audio–video interaction module. It updates audio features using the current visual context and the refocused pose prior through a designed bidirectional cross-attention. This yields better short-term synchronization and motion coherence. The framework supports both pose-controlled and pose-free inference within one model. Extensive experiments on EMTD and HDTF show clear gains over strong baselines in lip–audio synchronization, gesture expressiveness, and overall motion naturalness.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 5696
Loading