Here, we present side-by-side comparisons of the joint-level keypoints generated by our model and their conversion to SMPL. It can be seen our model outputs smooth and fluid motions, and the observed sudden movements and lack of fluidity arise during the SMPL conversion, as the utilized conversion code processes each frame independently.
We showcase InterMask's capability to generate longer interactions sequences - 10 seconds.
Here, we present side-by-side comparisons of generated results from ablation study on the Inter-M Transformer. It demonstrates the specific contributions of each attention mechanism in different interaction scenarios, such as boxing, synchronized dancing, and sneaking up. The spatio-temporal attention module is crucial for handling complex poses and spatial awareness, the cross-attention mechanism ensures accurate and temporally synchronized reactions, and the self-attention module refines the overall quality.
We showcase InterMask's capability to generate interactions for more complex or less structured (in-the-wild) instructions.
For complex instructions involving multiple steps in progression and alternating actions between two individuals, our model performs well, as demonstrated in the first two examples.
However, for more out-of-distribution texts that the model did not encounter during training, it interprets cues as best as possible to generate plausible interactions. For instance, while the model does not fully understand "pointing a gun," it generates a sample where one person points at the other, who raises their hands. Similarly, in the "Goku vs. Vegeta" scenario, the model understands the context of a fight, producing karate-like poses but not specific moves like "kamehameha." For prompts like the "Fortnite Orange Justice dance," it generates a celebratory dance with two winners but does not replicate the specific moves.
These limitations highlight the need for future work, which could incorporate foundational models of language, motion, or multimodal representations, utilize additional single-person motion data, or expand interaction datasets, potentially sourced from internet videos.
InterMask can generate high-quality 3D human interactions across diverse text inputs. Here, we show 15 distinct examples of generated interactions including everyday actions, dancing and fighting.
InterMask follows specific details in more nuanced text descriptions like number of steps and body relative directions.
Our InterMask also maintains a certain level of diversity during generation. For each example below, we show two distinct generated samples side by side, from the same text description.
We compare InterMask against a strong diffusion model baseline approach, InterGen.
We showcase InterMask's capability to perform the reaction generation task, where the motion of one individual is generated depending on the provided reference motion of the other, with and without text descriptions. The reference motion is shown in pink, and the generated motion is shown in blue.
While InterMask demonstrates strong capabilities in generating 3D human interactions, challenges arise in certain scenarios when the individuals are in close proximity or when the movements are rapid. Below, we present two such failure cases, with the output joint skeleton and the converted SMPL mesh. Even though the output joint skeleton is sufficiently accurate, the conversion to SMPL meshes introduces penetration and jerky movements. A potential solution to this problem is to incorporate the SMPL conversion process in training and employ geometric and interaction losses on the final meshes.
Another limitation is that our model sometimes interprets motions as dances without explicit prompting, likely due to implicit
biases in the training dataset. As shown below, even though does not mention about dancing, the model still interprets it as such.