Uniform Text-Motion Generation and Editing via Diffusion Model

Published: 10 Oct 2024, Last Modified: 19 Nov 2024AFM 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Unified Framework, Contrastive Learning and Latent Space, Multimodal Control and Editing
Abstract: Diffusion excels in controllable generation for continuous modalities, ideal for continuous motion generation. However, its flexibility is limited, focusing solely on text-to-motion generation and lacking motion editing capabilities. To address these issues, we introduce UniTMGE, a uniform text-motion generation and editing framework based on diffusion. UniTMGE overcomes single-modality limitations, enabling efficient and effective performance across multiple tasks like text-driven motion generation, motion captioning, motion completion, and multi-modal motion editing. UniTMGE comprises three components: CTMV for mapping text and motion into a shared latent space using contrastive learning, a controllable diffusion model customized for the CTMV space, and MCRE for unifying multimodal conditions into CLIP representations, enabling precise multimodal control and flexible motion editing through simple linear operations. We conducted both closed-world experiments and open-world experiments using the Motion-X dataset with detailed text descriptions, with results demonstrating our model's effectiveness and generalizability across multiple tasks.
Submission Number: 33
Loading