Keywords: gesture synthesis, computer animation, neural networks
Abstract: In this paper, we describe the gesture synthesis system we developed for our entry to the GENEA Challenge 2023. One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance. Therefore compared to a deterministic regression model, a probabilistic model will be preferred to handle the one-to-many mapping problem. Our system utilizes the vector-quantized variational autoencoder (VQ-VAE) and discrete diffusion as the framework for predicting co-speech gestures. Since the gesture motions are produced via sampling the discrete gesture tokens using the discrete diffusion process, the method is able to produce diverse gestures given the same speech input. Based on the user evaluation results, we further discuss about the strength and limitations of our system, and provide the lessons learned when developing and tuning the system. The subjective evaluation results show that our method ranks in the middle for human-likeness among all submitted entries. In the the speech appropriateness evaluations, our method has preferences of 55.4% for matched agent gesture and 51.1% for matched interlocutor gestures. Overall, we demonstrated the potential of discrete diffusion models in gesture generation.
3 Replies
Loading