Keywords: Motion-Language Model; Discrete Diffusion Model; Mask Modeling; Residual Vector Quantization
Abstract: We present MotionDDM, a diffusion-LLM framework for bidirectional text-motion understanding and generation. Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, MotionDDM performs multi-step parallel denoising, unifying Text-to-Motion (T2M), Motion-to-Text (M2T), and text-free Motion-to-Motion (M2M) within a single model. This decoding paradigm naturally enables a quality-latency trade-off at inference. On HumanML3D, our method achieves competitive T2M/M2T results against strong baselines. We also incorporate Residual VQ (RVQ) as the motion tokenizer to improve quantization fidelity, and adopt GRPO within the framework to enhance alignment and controllability. To the best of our knowledge, this is the first work to bring diffusion-LLMs to bidirectional text-motion modeling.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 3034
Loading