Track: Challenge paper
Team Name: DeepMotion
Keywords: gesture synthesis, computer animation, neural network
Abstract: This paper describes the method and evaluation results of our DeepMotion entry to the GENEA Challenge 2022. One difficulty in data-driven gesture synthesis is that there may be multiple viable gesture motions for the same speech utterance. Therefore the deterministic regression methods can not resolve the conflicting samples and may produce more damped motions. We proposed a two-stage model to address this uncertainty issue in gesture synthesis. Inspired by recent text-to-image synthesis methods, our gesture synthesis system utilizes a VQ-VAE model to first extract smaller gesture units as codebook vectors from training data. An autoregressive model based on GPT-2 transformer is then applied to model the probability distribution on the discrete latent space of VQ-VAE. The user evaluation results show the proposed method is able to produce gesture motions with reasonable human-likeness and gesture appropriateness.