Towards Efficient and Diverse Generative Model for Unconditional Human Motion Synthesis

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent generative methods have revolutionized the way of human motion synthesis, such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Denoising Diffusion Probabilistic Models (DMs). These methods have gained significant attention in human motion fields. However, there are still challenges in unconditionally generating highly diverse human motions from a given distribution. To enhance the diversity of synthesized human motions, previous methods usually employ deep neural networks (DNNs) to train a transport map that transforms Gaussian noise distribution into real human motion distribution. According to Figalli's regularity theory, the optimal transport map computed by DNNs frequently exhibits discontinuities. This is due to the inherent limitation of DNNs in representing only continuous maps. Consequently, the generated human motions tend to heavily concentrate on densely populated regions of the data distribution, resulting in mode collapse or mode mixture. To address the issues, we propose an efficient method called MOOT for unconditional human motion synthesis. First, we utilize a reconstruction network based on GRU and transformer to map human motions to latent space. Next, we employ convex optimization to map the noise distribution to the latent space distribution of human motions through the Optimal Transport (OT) map. Then, we combine the extended OT map with the generator of reconstruction network to generate new human motions. Thereby overcoming the issues of mode collapse and mode mixture. MOOT generates a latent code distribution that is well-behaved and highly structured, providing a strong motion prior for various applications in the field of human motion. Through qualitative and quantitative experiments, MOOT achieves state-of-the-art results surpassing the latest methods, validating its superiority in unconditional human motion generation.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Content] Vision and Language, [Content] Multimodal Fusion
Relevance To Conference: This work makes a significant contribution to human motion understanding in multimedia applications by addressing key challenges and advancing the field in several ways. Firstly, our research employs a combination of GRU and transformer modules to encode human motions into latent space, and computes the Brenier potential to represent the optimal transport map within the latent space. The method efficiently locates and avoids the singularity set defined by Figalli when generating new human motions, eliminating mode collapse and mode mixture. Secondly, we propose to explore the Brenier potential, and the optimization of the transport map can be accelerated using a GPU-based convex optimization algorithm. This operation ensures convergence to a unique global optimum while providing a bounded error estimate. The proposed method is evaluated on widely-used human motion datasets in the comprehensive experiments. The obtained results demonstrate the effectiveness of the proposed method over the state-of-the-art approaches for unconditional human motion generation task. Overall, this work contributes to the advancement of multimedia by introducing innovative techniques, addressing key challenges, and exploring the broader implications of this field. It aligns with the core themes and research directions of ACM MM Conference, making a valuable addition to the conference proceedings.
Submission Number: 2528
Loading