StableMoFusion: Towards Robust and Efficient Diffusion-based Motion Generation Framework

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Thanks to the powerful generative capacity of diffusion models, recent years have witnessed rapid progress in human motion generation. Existing diffusion-based methods employ disparate network architectures and training strategies. The effect of the design of each component is still unclear. In addition, the iterative denoising process consumes considerable computational overhead, which is prohibitive for real-time scenarios such as virtual characters and humanoid robots. For this reason, we first conduct a comprehensive investigation into network architectures, training strategies, and inference processs. Based on the profound analysis, we tailor each component for efficient high-quality human motion generation. Despite the promising performance, the tailored model still suffers from foot skating which is an ubiquitous issue in diffusion-based solutions. To eliminate footskate, we identify foot-ground contact and correct foot motions along the denoising process. By organically combining these well-designed components together, we present StableMoFusion, a robust and efficient framework for human motion generation. Extensive experimental results show that our StableMoFusion performs favorably against current state-of-the-art methods.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Generation] Multimedia Foundation Models
Relevance To Conference: Our work aims at filling existing research gaps of motion generation and enhance the effectiveness and reliability of diffusion-based motion generation in practical applications. Human motion generation aims to generate natural, realistic, and diverse human motions driven by multi-modal conditional information (e.g., text and audio), which could be used for animating virtual characters or manipulating humanoid robots. Our exploration is specifically directed towards text conditional motion generation. Our work present a robust and efficient framework to enhance the effectiveness and reliability of text-conditional diffusion-based motion generation. Extensive experiments demonstrate that our framework achieves an excellent trade-off between text-motion consistency and motion quality compared to other state-of-the-art methods, which offers valuable insights for researchers and practitioners in the field, guiding future text-to-motion developments and applications. Furthermore, we propose an effective solution within diffusion to the footskate problem that often occurs during the mapping from textual to kinematic modality.
Supplementary Material: zip
Submission Number: 5362
Loading