Weakly Supervised Motion Learning for
Co-speech Gesture Video Generation

ICLR 2026 [Submission ID: 7934]

Videos for Comparisons Videos for Ablation Studies Videos for Other Identities Long Videos

Videos for Comparisons

On this page, we present video comparisons between our method, S2G, MYA, and EchoMimicV2.

Our approach generates high-fidelity videos with clear hand details, realistic finger articulation, and stable backgrounds. In contrast, S2G, MYA, and EchoMimicV2 struggle with visual consistency, exhibiting background flickering, hand blurring, and noticeable finger distortions. Furthermore, MYA tends to overfit to appearance features seen during training, causing it to reproduce memorized attributes rather than adhering to the provided reference image, leading to noticeable inconsistencies.

First Frame

S2G

MYA

EchoMimicV2

Ours

First Frame

S2G

MYA

EchoMimicV2

Ours

First Frame

S2G

MYA

EchoMimicV2

Ours

First Frame

S2G

MYA

EchoMimicV2

Ours

First Frame

S2G

MYA

EchoMimicV2

Ours

First Frame

S2G

MYA

EchoMimicV2

Ours

First Frame

S2G

MYA

EchoMimicV2

Ours

First Frame

S2G

MYA

EchoMimicV2

Ours

First Frame

S2G

MYA

EchoMimicV2

Ours