<!DOCTYPE html>
<html data-wf-domain="" data-wf-page="596e65d120426e09785027f0" data-wf-site="596e65d120426e09785027eb" data-wf-status="1"
    class="w-mod-js wf-opensans-n3-active wf-opensans-n4-active wf-roboto-n4-active wf-opensans-i3-active wf-opensans-i4-active wf-opensans-n6-active wf-opensans-i6-active wf-opensans-n7-active wf-opensans-i7-active wf-opensans-n8-active wf-opensans-i8-active wf-roboto-n3-active wf-roboto-n5-active wf-active">

<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

    <title>InterMask</title>
    <meta content="width=device-width, initial-scale=1" name="viewport">
    <meta content="Webflow" name="generator">
    <link href="./files/supplemental.css" rel="stylesheet" type="text/css">
    <script src="./files/webfont.js" type="text/javascript"></script>
    <script type="text/javascript">
        WebFont.load({
            google: {
                families: ["Open Sans:300,300italic,400,400italic,600,600italic,700,700italic,800,800italic", "Roboto:300,regular,500"]
            }
        });
    </script>
    <script type="text/javascript">
        ! function (o, c) {
            var n = c.documentElement,
                t = " w-mod-";
            n.className += t + "js", ("ontouchstart" in o || o.DocumentTouch && c instanceof DocumentTouch) && (n.className += t + "touch")
        }(window, document);
    </script>
</head>


<body class="body">
    <div class="section">
        <div class="container-3 w-container">
            <h1 class="papertitle">InterMask: 3D Human Interaction Generation via Collaborative Masked Modelling</h1>
            <div class="text-block">ICLR 2025 - Submission No: 12356</div>
            <div class="text-block">Muhammad Gohar Javed<sup>1</sup>, Chuan Guo<sup>2</sup>, Li Cheng<sup>1</sup>, Xingyu Li<sup>1</sup> <br><br>
                <sup>1</sup>University of Alberta, <sup>2</sup>Snap Inc.
            </div>


        </div>
    </div>

    <div class="section-2">
        <div class="container w-container">
            <ul role="list" class="list">
                
                <li class="list-item">
                    <a href="#experiment_1">1. Interaction Generation Gallery</a>
                            <ul role="list">
                                <li>
                                    <a href="#experiment_1_1">Everyday Actions</a>
                                </li>
                                <li>
                                    <a href="#experiment_1_2">Dance</a>
                                </li>
                                <li>
                                    <a href="#experiment_1_3">Combat</a>
                                </li>
                            </ul>
                        </li>
                <li>
                    <a href="#experiment_2">2. Nuanced Descriptions</a>
                </li>
                <li>
                    <a href="#experiment_3">3. Diversity</a>
                </li>
                <li class="list-item">
                    <a href="#experiment_4">4. Comparison</a>
                </li>
                <li>
                    <a href="#experiment_c">5. Ablation Results on Inter-M Transformer</a>
                </li>
                <li class="list-item">
                    <a href="#experiment_5">6. Application: Reaction Generation</a>
                </li>
                <li>
                    <a href="#experiment_d">7. Complex / In-the-wild Text Instructions</a>
                </li>
                <li>
                    <a href="#experiment_b">8. Longer Results - 10 sec</a>
                </li>
                <li class="list-item">
                    <a href="#experiment_6">9. Failure Case</a>
                </li>
                <li>
                    <a href="#experiment_a">10. Fluidity in Generated Motions</a>
                </li>

            </ul>
        </div>
    </div>

    <div>
    <div class="container-2 w-container">
            <div class="w-container">
                <h3 id="experiment_1" class="subexperimenttitle">1. Interaction Generation Gallery</h3>
                <p class="paragraph">
                    InterMask can generate high-quality 3D human interactions across diverse text inputs. Here, we show 15 distinct examples of generated interactions including everyday actions, dancing and fighting.<br><br>
                </p>
                <h4 id="experiment_1_1" class="subsubexperimenttitle"><center>Everyday Actions</center></h4>
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-4">
                        <br>
                        <div class="three-line-container">
                            <span class="three-line-text">Two people are <strong>spinning around in clockwise direction</strong></span>
                          </div>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/interhuman_gallery/everyday/animation1_27.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-4">
                        <br>
                        <div class="three-line-container">
                            <span class="three-line-text">One person <strong><span style="color:#47b1d5;">dashes</span></strong> towards <strong><span style="color:#c05252;">the other</span></strong></span>
                          </div>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/interhuman_gallery/everyday/animation0_16.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-4">
                        <br>
                        <div class="three-line-container">
                            <span class="three-line-text">Both play <strong> rock paper scissors</strong> with their <strong>right hands</strong></span>
                          </div>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/interhuman_gallery/everyday/animation0_40.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <!-- <div class="w-col w-col w-col-4">
                        <br>
                        <div class="three-line-container">
                            <span class="three-line-text">The first one <strong><span style="color:#c05252;">squats</span></strong> with their back to the second one while the second one <strong><span style="color:#47b1d5;">moves towards them</span></strong></span>
                          </div>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/interhuman_gallery/everyday/animation0_05.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div> -->
                </div>

                <div class="videoresult w-row">
                    
                    <div class="w-col w-col w-col-4">
                        <br>
                        <div class="three-line-container">
                            <span class="three-line-text">The two are blaming each other and <strong>having an intense argument</strong></span>
                          </div>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/interhuman_gallery/everyday/infer_3.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-4">
                        <br>
                        <div class="three-line-container">
                            <span class="three-line-text">The first <strong><span style="color:#47b1d5;">runs to their right</span></strong> and the other begins to <strong><span style="color:#c05252;">chase them</span></strong></span>
                          </div>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/interhuman_gallery/everyday/animation0_46.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-4">
                        <br>
                        <div class="three-line-container">
                            <span class="three-line-text">One person <strong><span style="color:#c05252;">tosses something</span></strong> to the other and the other <strong><span style="color:#47b1d5;">catches it</span></strong></span>
                          </div>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/interhuman_gallery/everyday/animation2_33.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                </div>
                
                
                <hr>
                <h4 id="experiment_1_2" class="subsubexperimenttitle"><center>Dance</center></h4>
                
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-4">
                        <br>
                        <div class="three-line-container">
                            <span class="three-line-text">Both are performing <strong>synchronized dance moves</strong></span>
                          </div>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/interhuman_gallery/dance/infer_9.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-4">
                        <br>
                        <div class="three-line-container">
                            <span class="three-line-text">They both <strong>swing their hands four times</strong> and finally <strong>raise their right feet</strong></span>
                          </div>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/interhuman_gallery/dance/animation1_08.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-4">
                        <br>
                        <div class="three-line-container">
                            <span class="three-line-text">While <strong>slow dancing</strong> one <strong><span style="color:#47b1d5;">takes a step with his right foot</span></strong></span>
                          </div>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/interhuman_gallery/dance/animation0_07.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                </div>

                <hr>
                <h4 id="experiment_1_3" class="subsubexperimenttitle"><center>Combat</center></h4>
                
                
                <!-- <hr> -->
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-4">
                        <br>
                        <div class="three-line-container">
                            <span class="three-line-text">One <strong><span style="color:#47b1d5;">takes a step forward and strikes with right hand</span></strong>, the other <strong><span style="color:#c05252;">tries to block and takes a step back</span></strong></span>
                          </div>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/interhuman_gallery/fight/animation1_42.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-4">
                        <br>
                        <div class="three-line-container">
                            <span class="three-line-text">First person <strong><span style="color:#47b1d5;">lifts right leg to strike</span></strong>, while other person <strong><span style="color:#c05252;">responds by raising their right leg</span></strong></span>
                          </div>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/interhuman_gallery/fight/animation0_24.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-4">
                        <br>
                        <div class="three-line-container">
                            <span class="three-line-text">One person <strong><span style="color:#47b1d5;">steps forward with their right leg and raises both hands to fight</span></strong>. The other <strong><span style="color:#c05252;">steps forward with their right leg</span></strong></span>
                          </div>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/interhuman_gallery/fight/animation0_12.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                </div>

                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-4">
                        <br>
                        <div class="three-line-container">
                            <span class="three-line-text">Two people <strong>move towards their right</strong>, they <strong> face each other</strong> and prepare for the next move</span>
                          </div>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/interhuman_gallery/fight/animation2_01.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-4">
                        <br>
                        <div class="three-line-container">
                            <span class="three-line-text">One person <strong><span style="color:#47b1d5;">strikes the other</span></strong> with a sword and the other <strong><span style="color:#c05252;">dodges</span></strong></span>
                          </div>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/interhuman_gallery/fight/animation2_15.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-4">
                        <br>
                        <div class="three-line-container">
                            <span class="three-line-text">The other person <strong><span style="color:#c05252;">strikes one with their right hand</span></strong>, and one <strong><span style="color:#47b1d5;">blocks it with their left hand</span></strong>. then they separate</span>
                          </div>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/interhuman_gallery/fight/animation2_06.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                </div>

                
                <hr>

                
                
                
                
                <!-- <hr> -->
            </div>

            <div class="w-container">
                <h3 id="experiment_2" class="subexperimenttitle">2. Nuanced Descriptions</h3>
                <p class="paragraph">
                    InterMask follows specific details in more nuanced text descriptions like number of steps and body relative directions. <br><br>
                </p>
                <br>
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 12pt;">One person <strong><span style="color:#c05252;">takes <span style="color:red;">five</span> steps</span></strong> to get to the other person's back, who is <strong><span style="color:#47b1d5;">sitting in a chair holding something in their hands</span></strong></span> </center>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/nuanced/animation2_37.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 12pt;">The two guys lower their arms and proceed to move forward and <strong>take <span style="color:red;">4</span> steps</strong></span> </center>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/nuanced/animation0_43.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                </div>
                <!-- <hr> -->

                
                <br>
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 12pt;">One takes a <strong><span style="color:#47b1d5;">step forward with the <span style="color:red;">left</span> foot</span></strong>, and <strong><span style="color:#c05252;">another with the <span style="color:red;">right</span> foot</span></strong>, they reach out with <strong><span style="color:#47b1d5;">the first person's <span style="color:red;">left</span> hand</span></strong> grabbing the <strong><span style="color:#c05252;">other person's <span style="color:red;">right</span> arm</span></strong> and their other arms crosses</span> </center>
                        <br>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/nuanced/animation2_02.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 12pt;">One takes a <strong><span style="color:#c05252;">step forwards with their <span style="color:red;">right</span> foot</span></strong> while the other <strong><span style="color:#47b1d5;">takes a step towards right with their <span style="color:red;">right</span> foot</span></strong></span> </center>
                        <br>
                        <video width="100%" height="100%" source="" src="./demo_videos/nuanced/animation0_09.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                </div>
                
                <hr>

            </div>


            <div class="w-container">
                <h3 id="experiment_3" class="subexperimenttitle">3. Diverse Generation</h3>
                <p class="paragraph">
                    Our InterMask also maintains a certain level of diversity during generation. For each example below, we show two distinct generated samples side by side, from the same text description.<br><br>
                </p>
                <center><span style="font-size: 12pt;">In an intense boxing match, one is <strong>continuously punching</strong> while the other is <strong>defending and counterattacking</strong></span> </center>
                <br>
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-6">
                        <video width="100%" height="100%" source="" src="./demo_videos/diversity/infer_0_0.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-6">
                        <video width="100%" height="100%" source="" src="./demo_videos/diversity/infer_0_1.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                </div>
                <!-- <hr> -->

                <center><span style="font-size: 12pt;">Two people are <strong>waving their hands and performing a dance step</strong> together</span> </center>
                <br>
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-6">
                        <video width="100%" height="100%" source="" src="./demo_videos/diversity/infer_8_0.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-6">
                        <video width="100%" height="100%" source="" src="./demo_videos/diversity/infer_8_1.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                </div>
                <!-- <hr> -->

                <center><span style="font-size: 12pt;">The first person <strong>raises the right leg aggressively</strong> towards the second</span> </center>
                <br>
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-6">
                        <video width="100%" height="100%" source="" src="./demo_videos/diversity/infer_10_0.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-6">
                        <video width="100%" height="100%" source="" src="./demo_videos/diversity/infer_10_1.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                </div>
                <!-- <hr> -->

                <center><span style="font-size: 12pt;">Two fencers <strong>engage in a thrilling duel</strong>, their sabres clashing and sparking as they strive for victory</span> </center>
                <br>
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-6">
                        <video width="100%" height="100%" source="" src="./demo_videos/diversity/infer_2_0.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-6">
                        <video width="100%" height="100%" source="" src="./demo_videos/diversity/infer_2_1.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                </div>
                <hr>

            </div>
        </div>
    </div>

    <div>
        <div class="container-2 w-container">
            <div class="container-2 w-container">
                <h3 id="experiment_4" class="experimenttitle">4. Comparison</h3>
            </div>
            <div class="w-container">
                <p class="paragraph">
                    We compare InterMask against a strong diffusion model baseline approach, <a href="https://tr3e.github.io/intergen-page/">InterGen</a>.
                </p>
                
                <center><span style="font-size: 12pt;">The first person is <strong><span style="color:#47b1d5;">sitting on a chair</span></strong>, their hands resting in their lap, while the other person <strong><span style="color:#c05252;">takes a step towards them</span></strong></span> </center>
                <br>
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 11pt;">InterGen</span> </center>
                        <video width="100%" height="100%" source="" src="./demo_videos/compare/animation2_13_intergen.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 11pt;"><strong>InterMask</strong> (Ours)</span> </center>
                        <video width="100%" height="100%" source="" src="./demo_videos/compare/animation2_13.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                </div>
                <!-- <hr> -->

                <center><span style="font-size: 12pt;">Two people <strong>bow</strong> to each other</span> </center>
                <br>
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 11pt;">InterGen</span> </center>
                        <video width="100%" height="100%" source="" src="./demo_videos/compare/infer_5_intergen.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 11pt;"><strong>InterMask</strong> (Ours)</span> </center>
                        <video width="100%" height="100%" source="" src="./demo_videos/compare/infer_5.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                </div>
                <!-- <hr> -->

                <center><span style="font-size: 12pt;">One person <strong><span style="color:#47b1d5;">sneaks up</span></strong> on <strong><span style="color:#c05252;">the other</span></strong> from behind</span> </center>
                <br>
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 11pt;">InterGen</span> </center>
                        <video width="100%" height="100%" source="" src="./demo_videos/compare/animation0_19_intergen.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 11pt;"><strong>InterMask</strong> (Ours)</span> </center>
                        <video width="100%" height="100%" source="" src="./demo_videos/compare/animation0_19.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                </div>
                <!-- <hr> -->

                <center><span style="font-size: 12pt;">The first person <strong><span style="color:#47b1d5;">raises the right leg aggressively</span></strong> towards the <strong><span style="color:#c05252;">second</span></strong></span> </center>
                <br>
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 11pt;">InterGen</span> </center>
                        <video width="100%" height="100%" source="" src="./demo_videos/compare/infer_10_intergen.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 11pt;"><strong>InterMask</strong> (Ours)</span> </center>
                        <video width="100%" height="100%" source="" src="./demo_videos/compare/infer_10.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                </div>
                <!-- <hr> -->
                
                <center><span style="font-size: 12pt;">Two friends take a consecutive step with the adjacent foot, then <strong>take 5 strides forward</strong></span> </center>
                <br>
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 11pt;">InterGen</span> </center>
                        <video width="100%" height="100%" source="" src="./demo_videos/compare/animation0_32_intergen.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 11pt;"><strong>InterMask</strong> (Ours)</span> </center>
                        <video width="100%" height="100%" source="" src="./demo_videos/compare/animation0_32.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                </div>
                <!-- <hr> -->
                
                <center><span style="font-size: 12pt;">One person is <strong><span style="color:#c05252;">sitting and waving their hands</span></strong> at the other person, while the other <strong><span style="color:#47b1d5;">drifts away</span></strong></span> </center>
                <br>
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 11pt;">InterGen</span> </center>
                        <video width="100%" height="100%" source="" src="./demo_videos/compare/animation2_19_intergen.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 11pt;"><strong>InterMask</strong> (Ours)</span> </center>
                        <video width="100%" height="100%" source="" src="./demo_videos/compare/animation2_19.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                </div>
                <hr>

                
            </div>

        </div>
    </div>


    <div>
        <div class="container-2 w-container">
            <div class="container-2 w-container">
                <h3 id="experiment_c" class="experimenttitle">5. Ablation Results on Inter-M Transformer</h3>
            </div>
            <p class="paragraph">
                Here, we present side-by-side comparisons of generated results from ablation study on the Inter-M Transformer. It demonstrates the specific contributions of each attention mechanism in different interaction scenarios, such as boxing, synchronized dancing, and sneaking up. The spatio-temporal attention module is crucial for handling complex poses and spatial awareness, the cross-attention mechanism ensures accurate and temporally synchronized reactions, and the self-attention module refines the overall quality.
            
            <center><span style="font-size: 12pt;">In an intense boxing match, <strong><span style="color:#c05252;">one is continuously punching</span></strong> while the other <strong><span style="color:#47b1d5;">the other</span></strong> is defending and counterattacking</span> </center>  
            <br>
            <div class="videoresult w-row">
                <div class="w-col w-col w-col-3">
                    <span class="three-line-text"><center><em>InterMask</em></center></span>
                    <video width="100%" height="100%" source="" src="./demo_videos/ablation/boxing.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-3">
                    <span class="three-line-text"><center><em>w/o Spatio-Temporal Attention</em></center></span>
                    <video width="100%" height="100%" source="" src="./demo_videos/ablation/boxing_spatemp.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-3">
                    <span class="three-line-text"><center><em>w/o Cross Attention</em></center></span>
                    <video width="100%" height="100%" source="" src="./demo_videos/ablation/boxing_cross.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-3">
                    <span class="three-line-text"><center><em>w/o Self Attention</em></center></span>
                    <video width="100%" height="100%" source="" src="./demo_videos/ablation/boxing_self.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
            </div>
            
            <center><span style="font-size: 12pt;">One person <strong><span style="color:#47b1d5;">sneaks up</span></strong> on <strong><span style="color:#c05252;">the other</span></strong> from behind</span> </center>
            <br>
            <div class="videoresult w-row">
                <div class="w-col w-col w-col-3">
                    <span class="three-line-text"><center><em>InterMask</em></center></span>
                    <video width="100%" height="100%" source="" src="./demo_videos/ablation/sneak.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-3">
                    <span class="three-line-text"><center><em>w/o Spatio-Temporal Attention</em></center></span>
                    <video width="100%" height="100%" source="" src="./demo_videos/ablation/sneak_spatemp.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-3">
                    <span class="three-line-text"><center><em>w/o Cross Attention</em></center></span>
                    <video width="100%" height="100%" source="" src="./demo_videos/ablation/sneak_cross.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-3">
                    <span class="three-line-text"><center><em>w/o Self Attention</em></center></span>
                    <video width="100%" height="100%" source="" src="./demo_videos/ablation/sneak_self.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
            </div>
            
            <center><span style="font-size: 12pt;">Both are performing <strong>synchronized dance moves</strong> </span> </center>
            <br>
            <div class="videoresult w-row">
                <div class="w-col w-col w-col-3">
                    <span class="three-line-text"><center><em>InterMask</em></center></span>
                    <video width="100%" height="100%" source="" src="./demo_videos/ablation/dance.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-3">
                    <span class="three-line-text"><center><em>w/o Spatio-Temporal Attention</em></center></span>
                    <video width="100%" height="100%" source="" src="./demo_videos/ablation/dance_spatemp.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-3">
                    <span class="three-line-text"><center><em>w/o Cross Attention</em></center></span>
                    <video width="100%" height="100%" source="" src="./demo_videos/ablation/dance_cross.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-3">
                    <span class="three-line-text"><center><em>w/o Self Attention</em></center></span>
                    <video width="100%" height="100%" source="" src="./demo_videos/ablation/dance_self.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
            </div>

            <hr>
        </div>
    </div>

    <div>
        <div class="container-2 w-container">
            <div class="container-2 w-container">
                <h3 id="experiment_5" class="experimenttitle">6. Application: Reaction Generation</h3>
            </div>
            <p class="paragraph">
                We showcase InterMask's capability to perform the <strong>reaction generation</strong> task, where the motion of one individual is generated depending on the provided reference motion of the other, <strong>with and without text descriptions</strong>. The <strong>reference</strong> motion is shown in <span style="color:#c05252;">pink</span>, and the <strong>generated</strong> motion is shown in <span style="color:#47b1d5;">blue</span>.
            </p>
            
            <div class="videoresult w-row">
                <div class="w-col w-col w-col-4">
                    <br>
                    <div class="three-line-container">
                        <span class="three-line-text">These two take a <strong>step away from eachother and stretch their arms</strong></span>
                      </div>
                    <br>
                    <video width="100%" height="100%" source="" src="./demo_videos/reaction/30.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-4">
                    <br>
                    <div class="three-line-container">
                        <span class="three-line-text">These two <strong>raise their left hands</strong> and extend them towards the left</span>
                      </div>
                    <br>
                    <video width="100%" height="100%" source="" src="./demo_videos/reaction/42.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-4">
                    <br>
                    <div class="three-line-container">
                        <span class="three-line-text">One person <strong><span style="color:#47b1d5;">approaches</span></strong> <strong><span style="color:#c05252;">the other</span></strong></span>
                      </div>
                    <br>
                    <video width="100%" height="100%" source="" src="./demo_videos/reaction/48.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
            </div>

            <div class="videoresult w-row">
                <div class="w-col w-col w-col-4">
                    <br>
                    <div class="three-line-container">
                        <span class="three-line-text">One person <strong><span style="color:#c05252;">takes 4 steps towards the other</span></strong>, while the other is <strong><span style="color:#47b1d5;">sitting on a chair holding a piece of paper</span></strong></span>
                      </div>
                    <br>
                    <video width="100%" height="100%" source="" src="./demo_videos/reaction/37.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-4">
                    <br>
                    <div class="three-line-container">
                        <span class="three-line-text"><i>without text description</i></span>
                      </div>
                    <br>
                    <video width="100%" height="100%" source="" src="./demo_videos/reaction/02.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-4">
                    <br>
                    <div class="three-line-container">
                        <span class="three-line-text"><i>without text description</i></span>
                      </div>
                    <br>
                    <video width="100%" height="100%" source="" src="./demo_videos/reaction/16.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
            </div>


            <hr>
        </div>
    </div>

    <div>
        <div class="container-2 w-container">
            <div class="container-2 w-container">
                <h3 id="experiment_d" class="experimenttitle">7. Complex / In-the-wild Text Instructions</h3>
            </div>
            <p class="paragraph">
                We showcase InterMask's capability to generate interactions for more complex or less structured (in-the-wild) instructions. 
                <br>
                For complex instructions involving multiple steps in progression and alternating actions between two individuals, our model performs well, as demonstrated in the first two examples. 
                <br>
                However, for more out-of-distribution texts that the model did not encounter during training, it interprets cues as best as possible to generate plausible interactions. For instance, while the model does not fully understand "pointing a gun," it generates a sample where one person points at the other, who raises their hands. Similarly, in the "Goku vs. Vegeta" scenario, the model understands the context of a fight, producing karate-like poses but not specific moves like "kamehameha." For prompts like the "Fortnite Orange Justice dance," it generates a celebratory dance with two winners but does not replicate the specific moves. 
                <br>
                These limitations highlight the need for future work, which could incorporate foundational models of language, motion, or multimodal representations, utilize additional single-person motion data, or expand interaction datasets, potentially sourced from internet videos.
            </p>
            
            <div class="videoresult w-row">
                <div class="w-col w-col w-col-4">
                    <br>
                    <div class="three-line-container">
                        <span class="three-line-text">One person picks up something from the floor and hands it to the other person. The other person drops it on the floor and picks it up again.</span>
                      </div>
                    <br>
                    <video width="100%" height="100%" source="" src="./demo_videos/wild/handover.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-4">
                    <br>
                    <div class="three-line-container">
                        <span class="three-line-text">One person is sitting in a chair and waves to the other person, while the other person in running away. The first person suddenly gets up and starts chasing the other person.</span>
                      </div>
                    <br>
                    <video width="100%" height="100%" source="" src="./demo_videos/wild/sitting_running.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-4">
                    <br>
                    <div class="three-line-container">
                        <span class="three-line-text">One person is pointing a gun at the other person and takes a step towards them. The other person is acts scared and raises their hands.</span>
                      </div>
                    <br>
                    <video width="100%" height="100%" source="" src="./demo_videos/wild/gun.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
            </div>

            <div class="videoresult w-row">
                <div class="w-col w-col w-col-4">
                    <br>
                    <div class="three-line-container">
                        <span class="three-line-text">Goku and vegeta face each other in an epic battle. Goku performs his signature Kamehameha and vegeta performs his move Galick Gun.</span>
                      </div>
                    <br>
                    <video width="100%" height="100%" source="" src="./demo_videos/wild/vegeta.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-4">
                    <br>
                    <div class="three-line-container">
                        <span class="three-line-text">Two players win in fortnite and perform the orange justice dance step</span>
                      </div>
                    <br>
                    <video width="100%" height="100%" source="" src="./demo_videos/wild/fortnight.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
            </div>


            <hr>
        </div>
    </div>

    <div>
        <div class="container-2 w-container">
            <div class="container-2 w-container">
                <h3 id="experiment_b" class="experimenttitle">8. Longer Results - 10 sec </h3>
            </div>
            <p class="paragraph">
                We showcase InterMask's capability to generate longer interactions sequences - 10 seconds.
            </p>
            
            <div class="videoresult w-row">
                <div class="w-col w-col w-col-4">
                    <br>
                    <div class="three-line-container">
                        <span class="three-line-text">Two fencers are engaged in a sword figting match</span>
                      </div>
                    <br>
                    <video width="100%" height="100%" source="" src="./demo_videos/long/3_fencing.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-4">
                    <br>
                    <div class="three-line-container">
                        <span class="three-line-text">Two fighters are engaged in a boxing match</span>
                      </div>
                    <br>
                    <video width="100%" height="100%" source="" src="./demo_videos/long/2_boxing.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-4">
                    <br>
                    <div class="three-line-container">
                        <span class="three-line-text">One person is running around the other in circles</span>
                      </div>
                    <br>
                    <video width="100%" height="100%" source="" src="./demo_videos/long/5_circle.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
            </div>

            <div class="videoresult w-row">
                <div class="w-col w-col w-col-4">
                    <!-- <br>
                    <div class="three-line-container">
                        <span class="three-line-text">Two fencers are engaged in a sword figting match</span>
                      </div>
                    <br>
                    <video width="100%" height="100%" source="" src="./demo_videos/long/3_fencing.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video> -->
                </div>
                <div class="w-col w-col w-col-4">
                    <br>
                    <div class="three-line-container">
                        <span class="three-line-text">Two dancers are practicing dance steps</span>
                      </div>
                    <br>
                    <video width="100%" height="100%" source="" src="./demo_videos/long/4_dance.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
            </div>


            <hr>
        </div>
    </div>

    <div>
        <div class="container-2 w-container">
            <div class="container-2 w-container">
                <h3 id="experiment_6" class="experimenttitle">9. Failure Cases</h3>
            </div>
            <p class="paragraph">
                While InterMask demonstrates strong capabilities in generating 3D human interactions, challenges arise in certain scenarios when the individuals are in close proximity or when the movements are rapid. Below, we present two such failure cases, with the output joint skeleton and the converted SMPL mesh. Even though the output joint skeleton is sufficiently accurate, the conversion to SMPL meshes introduces penetration and jerky movements. A potential solution to this problem is to incorporate the SMPL conversion process in training and employ geometric and interaction losses on the final meshes.<br><br><br>
            </p>
            <center><span style="font-size: 12pt;">First person is <strong><span style="color:#c05252;">sitting in a chair</span></strong>, the second <strong><span style="color:#47b1d5;">takes a step forward with their right foot</span></strong>.</span> </center>
                <br>
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 10pt;">Output Joint Skeleton</span> </center>
                        <center><span style="font-size: 8pt;">Front View &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  Side View</span> </center>
                        <video width="100%" height="100%" source="" src="./demo_videos/failure/animation2_39_kp.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>

                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 10pt;">Converted SMPL Mesh</span> </center>
                        <center><span style="font-size: 8pt;">Front View &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Side View <br> <br> <br></span> </center>
                        <video width="100%" height="100%" source="" src="./demo_videos/failure/animation2_39.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    
                </div>
                <!-- <hr> -->

                <center><span style="font-size: 12pt;">These two <strong>spin to face each other</strong></span> </center>
                <br>
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 10pt;">Output Joint Skeleton</span> </center>
                        <center><span style="font-size: 8pt;">Front View &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  Side View</span> </center>
                        <video width="100%" height="100%" source="" src="./demo_videos/failure/animation2_43_kp.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>

                    <div class="w-col w-col w-col-6">
                        <center><span style="font-size: 10pt;">Converted SMPL Mesh</span> </center>
                        <center><span style="font-size: 8pt;">Front View &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Side View <br> <br> <br></span> </center>
                        <video width="100%" height="100%" source="" src="./demo_videos/failure/animation2_43.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    
                </div>

                <p class="paragraph">
                    Another limitation is that our model sometimes interprets motions as dances without explicit prompting, likely due to implicit
                    biases in the training dataset. As shown below, even though does not mention about dancing, the model still interprets it as such.<br><br><br>
                </p>
                
                <center><span style="font-size: 12pt;">The first takes a <strong><span style="color:#47b1d5;">step with their left foot</span></strong></strong></span> </center>
                <br>
                <div class="videoresult w-row">
                    <div class="w-col w-col w-col-12">
                        <center><span style="font-size: 8pt;">Front View &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;  &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;   Side View</span> </center>
                        <video width="100%" height="100%" source="" src="./demo_videos/failure/best_fid_dec_ft_ts20_cs2_topkr0.9_14_gen_mesh.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                    </div>
                    
                </div>
            <hr>
        </div>
    </div>

    <div>
        <div class="container-2 w-container">
            <div class="container-2 w-container">
                <h3 id="experiment_a" class="experimenttitle">10. Fluidity in Generated Motions</h3>
            </div>
            <p class="paragraph">
                Here, we present side-by-side comparisons of the joint-level keypoints generated by our model and their conversion to SMPL. It can be seen our model outputs smooth and fluid motions, and the observed sudden movements and lack of fluidity arise during the SMPL conversion, as the utilized conversion code processes each frame independently.
            </p>
            
            <div class="videoresult w-row">
                <div class="w-col w-col w-col-3">
                    <video width="100%" height="100%" source="" src="./demo_videos/smooth/dash_kp.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-3">
                    <video width="100%" height="100%" source="" src="./demo_videos/smooth/dash_smpl.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-3">
                    <video width="100%" height="100%" source="" src="./demo_videos/smooth/raise_kp.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-3">
                    <video width="100%" height="100%" source="" src="./demo_videos/smooth/raise_smpl.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
            </div>

            <div class="videoresult w-row">
                <div class="w-col w-col w-col-3">
                    <video width="100%" height="100%" source="" src="./demo_videos/smooth/dance_kp.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-3">
                    <video width="100%" height="100%" source="" src="./demo_videos/smooth/dance_smpl.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-3">
                    <video width="100%" height="100%" source="" src="./demo_videos/smooth/walk_behind_kp.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
                <div class="w-col w-col w-col-3">
                    <video width="100%" height="100%" source="" src="./demo_videos/smooth/walk_behind_smpl.mp4" type="video/mp4" loop="true" autoplay="autoplay" controls muted></video>
                </div>
            </div>

            <hr>
        </div>
    </div>

    <script src="./files/jquery-3.4.1.min.220afd743d.js" type="text/javascript" integrity="sha256-CSXorXvZcTkaix6Yvo6HppcZGetbYMGWSFlBw8HfCJo=" crossorigin="anonymous"></script>
    <script src="./files/webflow.3cd0ca831.js" type="text/javascript"></script>
</body>