Human Motion Diffusion as a Generative Prior

Anonymous Authors
Submission #2385
Our DiffusionBlending approach enables fine-grained control over human motion (see more details below).


We introduce three novel motion composition methods, all based on the recent Motion Diffusion Model (MDM). Sequential composition generating an arbitrary long motion with text control over each time interval. Parallel composition generating two-person motion from text. Model composition achieving accurate and flexible control by blending models with different control signals.

DoubleTake - Long Sequences Generation

DoubleTake

Our DoubleTake method (above) enables the efficient generation of long motion sequences in a zero-shot manner. Using it, we demonstrate 10-minute long fluent motions that were generated using a model that was trained only on ~10 second long sequences. In addition, instead of a global textual condition, DoubleTake controls each motion interval using a different text condition while maintaining realistic transitions between intervals. This result is fairly surprising considering that such transitions were not explicitly annotated in the training data. DoubleTake consists of two phases - in the first step, each motion is generated conditioned on a text prompt while being aware of the context of neighboring motions, all generated simultaneously in a single batch. Then, the second take exploits the denoising process to refine transitions to better match the intervals.

The following long motion was generated with DoubleTake in a single diffusion batch. Orange frames are the textually controlled interval, and the blue/purple frames are the transitions between them.

DoubleTake - Long Motion

Lighter frames represent transition between intervals.

DoubleTake - Results

Lighter frames represent transition between intervals.

DoubleTake vs. TEACH model

The followings are side-by-side views of our DoubleTake approach compared to TEACH[Athanasiou et al. 2022] that was dedicatedly learned for this task. Both got the same texts and sequence lengths to be generated.

ComMDM - Two-person motion generation

For the few-shot setting, we enable textually driven two-person motion generation for the first time. We exploit MDM as a motion prior for learning two-person motion generation using only as few as a dozen training examples. We observe that in order to learn human interactions, we only need to enable fixed prior models to communicate with each other through the diffusion process. Hence, we learn a slim communication block, ComMDM, that passes a communication signal between the two frozen priors through the transformer's intermediate activation maps.

ComMDM

Two-person - Text-to-Motion Generation

The followings are text-to-motion generations by our ComMDM model. The texts are unseen by the model but the interactions are fairly limited to those seen during training. Different color defines different character, both are generated simultaneously.

Two-person - Prefix Completions

The followings are side-by-side views of our ComMDM approach compared to MRT[Wang et al. 2021] that was dedicatedly learned for this task. Both got the same motion prefixes to be competed.

Blue is input prefix and orange/red is the generated completions by each model.

Fine-Tuned Motion Control

We observe that the motion inpainting process suggested by MDM[Tevet et al. 2022] does not extend well to more elaborate yet important motion tasks such as trajectory and end-effector tracking. We show that fine-tuning the prior for this task yields semantic and accurate control using even just a single end-effector. We further introduce the DiffusionBlending technique that generalizes classifier-free guidance to blend between different fine-tuned models and create any cross combination of keypoints control on the generated motion. This enables surgical control for human motion that comprises a key capability for any animation system.

The followings are side-by-side comparison of our fine-tuned MDM and DiffusionBlending (models with + sign) to MDM motion inpainting.

Trajectory + Text condition