IMAC: Implicit Motion-Audio Coupling
for Co-Speech Gesture Video Generation

ICLR 2025 [Submission ID: 6198]

Videos for Figure 5 and Figure 6
More Videos for Comparisons More Videos for Ablation Studies Videos for Other Identities

Videos for Figure 5 and 6

On this page, we present the videos corresponding to Figures 5 and 6. As shown in the video for Figure 5, our method produces high-quality videos without blurry hands or finger distortion and maintains a consistent background. In contrast, S2G and MYA exhibit inconsistent backgrounds and suffer from blurry hands and distorted fingers. Additionally, MYA often memorizes appearance features during training. This causes the generated videos to replicate the memorized appearance instead of using the reference image, resulting in inconsistencies. More comparison videos are provided on the "More Videos for Comparisons" page.

In the video for Figure 6, the incomplete model versions suffer from low visual quality, background inconsistencies with the reference image, distorted hands, extra fingers, and hands that appear detached from the body. Moreover, the generated videos show significant motion inconsistencies, with severe motion shaking. Additional videos for the ablation studies are available on the "More Videos for Ablation Studies" page.

Please ensure to play the audio in each video to hear the input speech.

GT

S2G

MYA

Ours

Figure 5


W/o Ref

W/o Motion

W/o First Stage

W/o Slow-Fast

Ours

Figure 6