Weakly Supervised Motion Learning for Co-speech Gesture Video Generation

Xinjie Li; Ziyi Chen; Xinlu Yu; Peng Chang; Jing Xiao; Abhinav Verma

Weakly Supervised Motion Learning for Co-speech Gesture Video Generation

Xinjie Li, Ziyi Chen, Xinlu Yu, Peng Chang, Jing Xiao, Abhinav Verma

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Co-speech Gesture Video Generation

Abstract: Co-speech gesture video generation is fundamental to natural human communication and plays a crucial role in human-computer interaction. Existing approaches typically rely on a two-stage framework, first generating intermediate pose representations before synthesizing the final video. While effective, these methods require extensive pose annotations, which often introduce labeling errors, and still struggle with fine-grained details, particularly in hand generation. To address these challenges, we propose a weakly supervised motion learning framework for co-speech gesture video generation that leverages only audio and video data. Our approach consists of three key stages: (1) a motion encoder that learns a generalizable motion representation from video without pose supervision, (2) a dual-tower architecture that aligns audio with the learned motion representation using an invertible feature extractor, and (3) a video diffusion model that refines fine-grained visual details. During sampling, we introduce a hand refinement method based on initial noise optimization, where learnable noise parameters are optimized via policy gradient to enhance hand synthesis. Extensive experiments on our collected dataset demonstrate that our approach outperforms prior methods across multiple metrics, achieving superior motion fidelity, gesture realism, and overall video quality.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 7934

Loading