MotionDDM: Motion Generation and Understanding via Discrete Diffusion Model

Zhang Ning; Zhengyu Li; Loh Kwong Weng; Xu Mingxi; Qi WANG; Zhengyu Wen; He Xiaoyu; Wei Zhao; Kehong Gong; Mingyuan Zhang

MotionDDM: Motion Generation and Understanding via Discrete Diffusion Model

Zhang Ning, Zhengyu Li, Loh Kwong Weng, Xu Mingxi, Qi WANG, Zhengyu Wen, He Xiaoyu, Wei Zhao, Kehong Gong, Mingyuan Zhang

08 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Motion-Language Model; Discrete Diffusion Model; Mask Modeling; Residual Vector Quantization

Abstract: We present MotionDDM, a diffusion-LLM framework for bidirectional text-motion understanding and generation. Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, MotionDDM performs multi-step parallel denoising, unifying Text-to-Motion (T2M), Motion-to-Text (M2T), and text-free Motion-to-Motion (M2M) within a single model. This decoding paradigm naturally enables a quality-latency trade-off at inference. On HumanML3D, our method achieves competitive T2M/M2T results against strong baselines. We also incorporate Residual VQ (RVQ) as the motion tokenizer to improve quantization fidelity, and adopt GRPO within the framework to enhance alignment and controllability. To the best of our knowledge, this is the first work to bring diffusion-LLMs to bidirectional text-motion modeling.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 3034

Loading