Keywords: Co-speech Gesture Video Generation
Abstract: Co-speech gesture video generation is fundamental to natural human communication and plays a crucial role in human-computer interaction. Existing approaches typically rely on a two-stage framework, first generating intermediate pose representations before synthesizing the final video. While effective, these methods require extensive pose annotations, which often introduce labeling errors, and still struggle with fine-grained details, particularly in hand generation. To address these challenges, we propose a weakly supervised motion learning framework for co-speech gesture video generation that leverages only audio and video data. Our approach consists of three key stages: (1) a motion encoder that learns a generalizable motion representation from video without pose supervision, (2) a dual-tower architecture that aligns audio with the learned motion representation using an invertible feature extractor, and (3) a video diffusion model that refines fine-grained visual details. During sampling, we introduce a hand refinement method based on initial noise optimization, where learnable noise parameters are optimized via policy gradient to enhance hand synthesis. Extensive experiments on our collected dataset demonstrate that our approach outperforms prior methods across multiple metrics, achieving superior motion fidelity, gesture realism, and overall video quality.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7934
Loading