Realistic-Gesture: Co-Speech Gesture Video Generation Through Context-aware Gesture Representation

Anonymous ICLR 2025 Submission (#2259),

Abstract

Co-speech gesture generation is crucial for creating lifelike avatars and enhancing human-computer interactions by synchronizing gestures with speech in computer vision. Despite recent advancements, existing methods often struggle with accurately aligning gesture motions with speech signals and achieving pixel-level realism. To address these challenges, we introduce Realistic-Gesture, a groundbreaking framework that transforms co-speech gesture video generation through three innovative components: (1) a speech-aware gesture representation that aligns facial and body gestures with speech semantics for fine-grained control, (2) a mask gesture generator that learns to map audio signals to gestures by predicting masked motion tokens, enabling bidirectional contextually relevant gesture synthesis and editing, and (3) a structure-aware refinement module that employs a multi-level differentiable edge connection to link gesture keypoints for detailed video generation. Our extensive experiments demonstrate that Realistic-Gesture not only produces highly realistic and speech-aligned gesture videos but also supports long-sequence generation and gesture editability applications.

Method

Left: Contrastive Learning for gesture-speech alignment. We distill the joint speech contextual-aware feature into latent codebook. Right: We use speech for generating discrete gesture motion tokens with Mask Gesture Generator. We apply random mask for token reconstruction during training and iterative remask based on probability for inference. Residual Gesture Generator finally based on the base VQ-tokens to predict the residual quantized tokens.

Rebuttal: Gesture Video Editing

In this example, we modify the first 7 seconds with the new audio, and the last 8 seconds from the original videos for generating the edited result.

Rebuttal: Ablation Studies on Contexutalized Motion Representation

Only relying on RVQ tokenization, the generated gestures are weakly aligned with the speech audio. By incorporating the pretrained audio encoder from the temporal alignment, this problem can be alleviated. Our contexutalized distillation can further enhance the temporal matching with more natural movements, beat patterns and faical expressions.

Rebuttal: Comparison on BEAT-X

Emage presents unnatural temporal transitions of gestures and jittorings. Our work achieves more aligned gesture motions conditioned on speech audio. With contextural distillation, the motion patterns can be more natural as shown on the left.

Rebuttal: Video Avatar Animation

We compare our image-warping based method with AnimateAnyone for video avatar animation. AnimateAnyone, though achieves high quality hand structures, fails to maintain the identity of the source speaker. In addition, it fails to capture the temporal background motions caused by camera movement within the video, leading to unstable background rendering.

Comparisons

We compare our method with S2G-Diffusion, ANGIE, we exlude the results of MM-Diffusion due to its inablilty to generate long sequence videos

Long Sequence Generation

We can achieve longer than 30s or even 1 min video speech-driven video generations.

Video Gesture Editing

In this example, we modify the last few seconds of the source video to gesture other patterns.

Gesture Pattern Transfer-1

We can re-enact different characters with the same audio to present the same gesture patterns.

Gesture Pattern Transfer-2

We can re-enact the same character with the same audio to present different gesture patterns.

BibTeX

@misc{
      Anonymous ICLR 2025 Submission,
      #2259
}