COME: Advancing Representation Learning and Generative Modeling for High-Quality Text-to-Motion Generation
Keywords: Text to Motion Generation, Human Motion Generation, Diffusion Model
TL;DR: A Continuous Motion Diffusion Model with Advanced Representation Learning and Generative Modeling for High-Quality Text-to-Motion Generation
Abstract: Text-to-Motion generation aims to synthesize realistic 3D human motion from natural language descriptions. Although continuous diffusion models naturally align with the temporal and spatial continuity of motion, they have underperformed discrete token-based approaches in generation quality. However, as T2M tasks evolve to include motion editing, personalization, and multimodal control, they increasingly demand fine-grained semantics, compositionality, and diverse sampling—capabilities better supported by continuous frameworks. Motivated by these real-world demands and the inherent continuity of motion, we revisit continuous diffusion modeling and identify two core limitations: (1) motion representations are often crowded and poorly separable, which increases the difficulty of generation and denoising; (2) suboptimal generative modeling that further degrades generation quality. To address these challenges, we propose COME, a continuous diffusion framework that enhances both motion representation and generative modeling. COME comprises two main components: the Motion Contrastive Masked Autoencoder (MoCMAE) and the Cross-Condition Diffusion Transformer (ccDIT).
MoCMAE employs an asymmetric hybrid architecture that integrates Masked Motion Modeling to extract key spatio-temporal features and Contrastive Learning to further enhance feature discriminability, thereby providing an expressive latent space. Meanwhile, ccDIT incorporates ccDIT block for global and fine-grained semantic comprehension and then utilizes Stable-Min-SNR-$\gamma$ to address training-inference inconsistencies and the conflicts across different timesteps, thus boosting generation quality. Extensive experiments show that COME achieves SOTA performance while improving inference and training efficiency, highlighting the effectiveness of our approach.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11989
Loading