EGGesture: Entropy-Guided Vector Quantized Variational AutoEncoder for Co-Speech Gesture Generation

Published: 20 Jul 2024, Last Modified: 25 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Co-Speech gesture generation encounters challenges with imbalanced, long-tailed gesture distributions. While recent methods typically address this by employing Vector Quantized Variational Autoencoder (VQ-VAE), encode gestures into a codebook and classify codebook indices based on audio or text cues. However, due to the imbalanced, the codebook classification tends to bias towards majority gestures, neglecting semantically rich minority gestures. To address this, this paper proposes the Entropy-Guided Co-Speech Gesture Generation (EGGesture). EGGesture leverages an Entropy-Guided VQ-VAE to jointly optimize the distribution of codebook indices and adjust loss weights for codebook index classification, which consists of a) A differentiable approach for entropy computation using Gumbel-Softmax and cosine similarity, facilitating online codebook distribution optimization, and b) a strategy that utilizes computed codebook entropy to collaboratively guide the classification loss weighting. These designs enable the dynamic refinement of the codebook utilization, striking a balance between the quality of the learned gesture representation and the accuracy of the classification phase. Experiments on the Trinity and BEAT datasets demonstrate EGGesture’s state-of-the-art performance both qualitatively and quantitatively. The code and video are available.
Primary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: EGGesture is research about generating vivid co-speech. Gestures have garnered interest across academia and industry, which is challenging as gesture motions are suffer an imbalanced, long-tailed distribution. We believe our topic is suitable for ACM MM.
Supplementary Material: zip
Submission Number: 3705
Loading