Abstract: We have recently seen tremendous progress in realistic text-to-motion generation. Yet, the existing methods of-ten fail or produce implausible motions with unseen text inputs, which limits the applications. In this paper, we present OMG, a novel framework, which enables compelling motion generation from zero-shot open-vocabulary text prompts. Our key idea is to carefully tailor the pretrain-then-finetune paradigm into the text-to-motion generation. At the pre-training stage, our model improves the gener-ation ability by learning the rich out-of-domain inherent motion traits. To this end, we scale up a large unconditional diffusion model up to 1B parameters, so as to utilize the massive unlabeled motion data up to over 20M motion instances. At the subsequent fine-tuning stage, we intro-duce motion ControlNet, which incorporates text prompts as conditioning information, through a trainable copy of the pre-trained model and the proposed novel Mixture-of-Controllers (MoC) block. MoC block adaptively rec-ognizes various ranges of the sub-motions with a cross-attention mechanism and processes them separately with the text-token-specific experts. Such a design effectively aligns the CLIP token embeddings of text prompts to var-ious ranges of compact and expressive motion features. Ex-tensive experiments demonstrate that our OMG achieves significant improvements over the state-of-the-art meth-ods on zero-shot text-to-motion generation. Project page: https://tr3e.github.io/omg-page.
Loading