Identity-Preserving Audio-Driven Holistic Human Motion Video Generation

Published: 01 Jan 2025, Last Modified: 17 Sept 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Generating realistic human motion videos is a pivotal challenge in advancing human-computer interaction. While existing approaches often focus on generating either head or gesture movements from audio, they lack unified control over full-body motion, frequently producing low-resolution and blurred outputs. Additionally, these methods struggle to maintain character identity throughout the generated content. In this paper, we introduce a novel framework that generates photorealistic, personalized human motion videos from audio by decoupling identity features. We integrate both visual features and voice timbre to enhance the preservation of character identity. Our approach follows a four-stage paradigm: (1) frame generation, (2) identity feature customization, (3) audio-motion modeling, and (4) motion-video rendering. Through the collaborative modeling of audio-motion and motion-video stages, our approach effectively maintains the consistency of character identity and background throughout the video, enhancing the realism and coherence of the generated video. Experimental results demonstrate that our framework delivers high-resolution videos with superior fidelity, establishing a new and effective baseline for holistic human motion video generation.
Loading