InterMask: 3D Human Interaction Generation via Collaborative Masked Modelling

ICLR 2025 - Submission No: 12356

A. Fluidity in Generated Motions (Added after reviews)

Here, we present side-by-side comparisons of the joint-level keypoints generated by our model and their conversion to SMPL. It can be seen our model outputs smooth and fluid motions, and the observed sudden movements and lack of fluidity arise during the SMPL conversion, as the utilized conversion code processes each frame independently.


B. Longer Results - 10 sec (Added after reviews)

We showcase InterMask's capability to generate longer interactions sequences - 10 seconds.


Two fencers are engaged in a sword figting match


Two fighters are engaged in a boxing match


One person is running around the other in circles


Two dancers are practicing dance steps


C. Ablation Results on Inter-M Transformer (Added after reviews)

Here, we present side-by-side comparisons of generated results from ablation study on the Inter-M Transformer. It demonstrates the specific contributions of each attention mechanism in different interaction scenarios, such as boxing, synchronized dancing, and sneaking up. The spatio-temporal attention module is crucial for handling complex poses and spatial awareness, the cross-attention mechanism ensures accurate and temporally synchronized reactions, and the self-attention module refines the overall quality.

In an intense boxing match, one is continuously punching while the other the other is defending and counterattacking

InterMask
w/o Spatio-Temporal Attention
w/o Cross Attention
w/o Self Attention
One person sneaks up on the other from behind

InterMask
w/o Spatio-Temporal Attention
w/o Cross Attention
w/o Self Attention
Both are performing synchronized dance moves

InterMask
w/o Spatio-Temporal Attention
w/o Cross Attention
w/o Self Attention

D. Complex / In-the-wild Text Instructions (Added after reviews)

We showcase InterMask's capability to generate interactions for more complex or less structured (in-the-wild) instructions.
For complex instructions involving multiple steps in progression and alternating actions between two individuals, our model performs well, as demonstrated in the first two examples.
However, for more out-of-distribution texts that the model did not encounter during training, it interprets cues as best as possible to generate plausible interactions. For instance, while the model does not fully understand "pointing a gun," it generates a sample where one person points at the other, who raises their hands. Similarly, in the "Goku vs. Vegeta" scenario, the model understands the context of a fight, producing karate-like poses but not specific moves like "kamehameha." For prompts like the "Fortnite Orange Justice dance," it generates a celebratory dance with two winners but does not replicate the specific moves.
These limitations highlight the need for future work, which could incorporate foundational models of language, motion, or multimodal representations, utilize additional single-person motion data, or expand interaction datasets, potentially sourced from internet videos.


One person picks up something from the floor and hands it to the other person. The other person drops it on the floor and picks it up again.


One person is sitting in a chair and waves to the other person, while the other person in running away. The first person suddenly gets up and starts chasing the other person.


One person is pointing a gun at the other person and takes a step towards them. The other person is acts scared and raises their hands.


Goku and vegeta face each other in an epic battle. Goku performs his signature Kamehameha and vegeta performs his move Galick Gun.


Two players win in fortnite and perform the orange justice dance step


1. Interaction Generation Gallery

InterMask can generate high-quality 3D human interactions across diverse text inputs. Here, we show 15 distinct examples of generated interactions including everyday actions, dancing and fighting.

Everyday Actions


Two people are spinning around in clockwise direction


One person dashes towards the other


Both play rock paper scissors with their right hands


The two are blaming each other and having an intense argument


The first runs to their right and the other begins to chase them


One person tosses something to the other and the other catches it


Dance


Both are performing synchronized dance moves


They both swing their hands four times and finally raise their right feet


While slow dancing one takes a step with his right foot


Combat


One takes a step forward and strikes with right hand, the other tries to block and takes a step back


First person lifts right leg to strike, while other person responds by raising their right leg


One person steps forward with their right leg and raises both hands to fight. The other steps forward with their right leg


Two people move towards their right, they face each other and prepare for the next move


One person strikes the other with a sword and the other dodges


The other person strikes one with their right hand, and one blocks it with their left hand. then they separate


2. Nuanced Descriptions

InterMask follows specific details in more nuanced text descriptions like number of steps and body relative directions.


One person takes five steps to get to the other person's back, who is sitting in a chair holding something in their hands

The two guys lower their arms and proceed to move forward and take 4 steps


One takes a step forward with the left foot, and another with the right foot, they reach out with the first person's left hand grabbing the other person's right arm and their other arms crosses


One takes a step forwards with their right foot while the other takes a step towards right with their right foot


3. Diverse Generation

Our InterMask also maintains a certain level of diversity during generation. For each example below, we show two distinct generated samples side by side, from the same text description.

In an intense boxing match, one is continuously punching while the other is defending and counterattacking

Two people are waving their hands and performing a dance step together

The first person raises the right leg aggressively towards the second

Two fencers engage in a thrilling duel, their sabres clashing and sparking as they strive for victory


4. Comparison

We compare InterMask against a strong diffusion model baseline approach, InterGen.

The first person is sitting on a chair, their hands resting in their lap, while the other person takes a step towards them

InterGen
InterMask (Ours)
Two people bow to each other

InterGen
InterMask (Ours)
One person sneaks up on the other from behind

InterGen
InterMask (Ours)
The first person raises the right leg aggressively towards the second

InterGen
InterMask (Ours)
Two friends take a consecutive step with the adjacent foot, then take 5 strides forward

InterGen
InterMask (Ours)
One person is sitting and waving their hands at the other person, while the other drifts away

InterGen
InterMask (Ours)

5. Application: Reaction Generation

We showcase InterMask's capability to perform the reaction generation task, where the motion of one individual is generated depending on the provided reference motion of the other, with and without text descriptions. The reference motion is shown in pink, and the generated motion is shown in blue.


These two take a step away from eachother and stretch their arms


These two raise their left hands and extend them towards the left


One person approaches the other


One person takes 4 steps towards the other, while the other is sitting on a chair holding a piece of paper


without text description


without text description


6. Failure Cases

While InterMask demonstrates strong capabilities in generating 3D human interactions, challenges arise in certain scenarios when the individuals are in close proximity or when the movements are rapid. Below, we present two such failure cases, with the output joint skeleton and the converted SMPL mesh. Even though the output joint skeleton is sufficiently accurate, the conversion to SMPL meshes introduces penetration and jerky movements. A potential solution to this problem is to incorporate the SMPL conversion process in training and employ geometric and interaction losses on the final meshes.


First person is sitting in a chair, the second takes a step forward with their right foot.

Output Joint Skeleton
Front View                                               Side View
Converted SMPL Mesh
Front View                                       Side View


These two spin to face each other

Output Joint Skeleton
Front View                                               Side View
Converted SMPL Mesh
Front View                                       Side View


Another limitation is that our model sometimes interprets motions as dances without explicit prompting, likely due to implicit biases in the training dataset. As shown below, even though does not mention about dancing, the model still interprets it as such.


The first takes a step with their left foot

Front View                                                                                                                                                 Side View