Weakly Supervised Motion Learning for
Co-speech Gesture Video Generation

ICLR 2026 [Submission ID: 7934]

Videos for Comparisons Videos for Ablation Studies Videos for Other Identities Long Videos

Videos for Ablation Studies

On this page, we present videos from different stages of our framework, illustrating its progressive improvements. Stages 1 and 2 effectively capture motion representations but lack fine-grained visual details. Stage 3 refines these details; however, hand artifacts remain. With the addition of hand refinement, our complete method produces visually coherent and artifact-free videos. Moreover, we observe that as hand quality improves, overall visual quality—including facial detail—also benefits. This suggests a strong correlation between hand realism and the overall perceptual quality of generated videos.

Additionally, we present results from Stage 2 without the invertible feature extractor, demonstrating its necessity in our framework. We also provide results where generation relies solely on audio, further emphasizing the limitations of audio-only video synthesis. These experiments validate the effectiveness of our method by underscoring the importance of motion representation learning.

Stage 1

Stage 2

Stage 2 W/o IFE

Stage 3

Ours

Stage 1

Stage 2

Stage 2 W/o IFE

Stage 3

Ours

Stage 1

Stage 2

Stage 2 W/o IFE

Stage 3

Ours

Stage 1

Stage 2

Stage 2 W/o IFE

Stage 3

Ours

W/o Motion Information

Ours

W/o Motion Information

Ours

W/o Motion Information

Ours