<script type="module" src="https://unpkg.com/@google/model-viewer/dist/model-viewer.min.js"></script>
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <meta name="robots" content="noindex">
  <link rel="StyleSheet" href="./files/style.css" type="text/css" media="all">

  <title>Agent Real-to-Sim: Learning Interactive Behavior from Casual Videos
  </title>

  <style type="text/css">
    body {
      font-family: Times;
      background-color: #f2f2f2;
      font-size: 15px;
    }

    .content {
      width: 890px;
      padding: 25px 25px;
      margin: 25px auto;
      background-color: #fff;
      border-radius: 20px;
    }

    .description {
      font-family: "Times";
      white-space: pre;
      text-align: left;
    }

    .content-title {
      background-color: inherit;
      margin-bottom: 0;
      padding-bottom: 0;
    }

    a,
    a:visited {
      text-decoration: none;
      color: blue;
    }

    .anchor {
      color: inherit;
    }

    #authors {
      text-align: center;
    }

    #conference {
      text-align: center;
      font-style: italic;
    }

    #authors a {
      margin: 0 10px;
    }

    h1 {
      text-align: center;
      font-family: Times;
      font-size: 35px;
    }

    h2 {
      font-family: Times;
      font-size: 25px;
      padding: 0;
      margin: 10px;
    }

    h3 {
      font-family: Times;
      font-size: 20px;
      padding: 0;
      margin: 10px;
    }

    p {
      font-family: Times;
      line-height: 130%;
      margin: 10px;
    }

    big {
      font-family: Times;
      font-size: 20px;
    }

    li {
      margin: 10px 0;
    }

    .samples {
      float: left;
      width: 50%;
      text-align: center;
    }

    .cond {
      float: left;
      margin: 0 40px;
    }

    .cond-container {
      width: 800px;
      margin: 0 auto;
      text-align: center;
    }

    #vidalign {
      display: block;
      margin: 0px;
      padding: 0px;
      position: relative;
      top: 90px;
      height: auto;
      max-width: auto;
      overflow-y: hidden;
      overflow-x: auto;
      word-wrap: normal;
      white-space: nowrap;
    }

    /* Add a black background color to the top navigation */
    .topnav {
      background-color: rgba(0, 0, 0, 0.2);
      z-index: 1;
      overflow: hidden;
      position: fixed;
      top: 0;
      /* Position the navbar at the top of the page */
      width: 100%;
      /* Full width */
    }

    /* Style the links inside the navigation bar */
    .topnav a {
      float: left;
      color: #333;
      text-align: center;
      padding: 14px 16px;
      text-decoration: none;
      font-size: 17px;
    }

    /* Change the color of links on hover */
    .topnav a:hover {
      background-color: #ddd;
      color: black;
    }

    /* Add a color to the active/current link */
    .topnav a.active {
      background-color: #04AA6D;
      color: white;
    }
  </style>

  <style>
    model-viewer {
      width: 300px;
      height: 300px;
    }
  </style>

</head>

<div class="topnav">
  <a class="active" href="#top">Top</a>
  <a href="#4D Reconstruction Results">4D Reconstruction Results</a>
  <a href="#Behavior Simulation Results">Behavior Simulation Results</a>
  <a href="#Visualization of Denoising Process">Visualization of the Denoising Process</a>
</div>

<div id="top" class="content content-title" style="text-align: center;">
  <h1>Agent Real-to-Sim: Learning Interactive Behavior from Casual Videos
  </h1>
  <big style="color:grey;"> In Submission to NeurIPS 2024 </big>
</div>


<div class="content">
  <figure style="font-family: Times; font-weight: normal; margin: 0px; padding: 0px; border: 0px; text-align: left">
    <img src="materials/teaser.png" style="width:100%;">
    <br>
    <br>
    <figcaption> TL;DR: We builid an simulatable agent in their familiar environment in 3D given casual video collected
      across a long time horizon (1 month).
      <br>
      <br>
      We aim to answer the following question: can we simulate the behavior of an agent, by learning from casually-captured videos of the same agent recorded across a long period of time (e.g., a month)? A) We first reconstruct videos in 4D (3D and time), which includes the scene, the trajectory of the agent, and the trajectory of the observer (i.e., camera held by observer's hand). Such individual 4D reconstruction are registered across time, resulting in a complete 4D reconstructions. B) Then we learn a representation of the agent that allows for interactive behavior simulation. The behavior model explicitly reasons about goals, paths, and full body movements conditioned on the agent's ego-perception and past trajectory. Such agent representation allows us to simulate novel scenarios through conditioning. For example, conditioned different observer trajectories, the cat agent choose to walk to the carpet, stays still while quivering his tail, or hide under the tray stand.
  </figure>
</div>


<div id="4D Reconstruction Results" class="content">
  <div style="float: right; width:70px; margin-top: 0px; margin-bottom: 25px">
  </div>
  <h2>Results: 4D Reconstruction</h2>
  <!-- <center> -->
  <table align=center width=810px>
    <tr>
      <td width=100%>
        <video playsinline controls autoplay muted loop width="100%">
          <source src="./materials/cat/export_0001/compressed_render-shape-compose-concat.mp4" type="video/mp4">
        </video>
    </tr>
  </table>
  <table align=center width=810px>
    <tr>
      <td>
        We show video results corresponding to Fig.2. Left: Reconstructions from camera view point; Right:
        reconstructions of the environment, agent, and user camera (shown as the moving coordinate) from top down view
        point. You may find more results on the cat dataset <a href='materials/cat.html'>[here]</a> (26 videos)</h3>
      </td>
    </tr>
  </table>
  <hr>
  <!-- </center> -->
  <table align=center width=810px>
    <tr>
      <td width=100%>
        <video playsinline controls autoplay muted loop width="33%">
          <source src="./materials/bunny/export_0001/render-shape-compose-concat.mp4" type="video/mp4">
        </video>
        <video playsinline controls autoplay muted loop width="33%">
          <source src="./materials/dog/export_0001/compressed_render-shape-compose-concat.mp4" type="video/mp4">
        </video>
        <video playsinline controls autoplay muted loop width="33%">
          <source src="./materials/human/export_0001/compressed_render-shape-compose-concat.mp4" type="video/mp4">
        </video>
    </tr>
  </table>
  <table align=center width=810px>
    <tr>
      <td>
        We show video results of recontructing a bunny, dog, and human agent.You may find more results on the <a href='materials/bunny/index.html'>[bunny]</a> dataset,
        the <a href='materials/dog/index.html'>[dog]</a> dataset, and the the <a href='materials/human/index.html'>[human]</a> dataset.
      </td>
    </tr>
  </table>
</div>

<div id="Behavior Simulation Results" class="content">
  <div style="float: right; width:70px; margin-top: 0px; margin-bottom: 25px">
  </div>
  <h2>Results: Behavior Simulation</h2>
  <table align=center width=810px>
    <tr>
      <td>
        <strong>User conditioning</strong>: We can simulate the behavior of an agent through the proxy of user
        location (represented by the axis).
      </td>
    </tr>
  </table>
  <table align=center width=810px>
    <tr>
      <td width=100%>
        <video playsinline controls autoplay muted loop width="100%">
          <source src="./materials/demo/user-control-a.mp4" type="video/mp4">
        </video>
    </tr>
  </table>
  <table align=center width=810px>
    <tr>
      <td>
        <strong>Goal conditioning</strong>: We can control the motion of an agent by manually setting the goals
        (represented by the blue spheres).
      </td>
    </tr>
  </table>
  <table align=center width=810px>
    <tr>
      <td width=100%>
        <video playsinline controls autoplay muted loop width="100%">
          <source src="./materials/demo/goal-cond-a.mp4" type="video/mp4">
        </video>
    </tr>
  </table>
  <hr>
  <hr>
  <table align=center width=810px>
    <tr>
      <td>
        <strong>Auto-regressive generation</strong>: We can simulate the behavior of the agent over a long time horizon (more than 30s while being trained on 5.6s)
        by conditioning on the environment and the past trajectory. Autoregressive generation results on all agents can be found <a href='materials/generation.html'>[here].
      </td>
    </tr>
  </table>
  <table align=center width=810px>
    <tr>
      <td width=100%>
        <video playsinline controls autoplay muted loop width="50%">
          <source src="./materials/generation/compressed_cat-following.mp4" type="video/mp4">
        </video>
    </tr>
  </table>
</div>



<div id="Visualization of Denoising Process" class="content">
  <div style="float: right; width:70px; margin-top: 0px; margin-bottom: 25px">
  </div>
  <h2>Visualizations: Hierarchical Motion Denoising (Fig.3) </h2>
  <!-- <center> -->
  <table align=center width=810px>
    <tr>
      <td>
        <h3>Goal denoising (w/ different conditioning signals)</h3>
        Scenario: Exploring a room
        <table align=center width=810px>
          <td width=33%>
            <video playsinline controls autoplay muted width="100%">
              <source src="./materials/vis/compressed_goal-0-animation-e.mp4" type="video/mp4">
            </video>
            <video playsinline controls autoplay muted width="100%">
              <source src="./materials/vis/compressed_goal-0-rendering-e.mp4" type="video/mp4">
            </video>
            Conditioned on environment.<br><br>
          <td width=33%>
            <video playsinline controls autoplay muted width="100%">
              <source src="./materials/vis/compressed_goal-0-animation-ep.mp4" type="video/mp4">
            </video>
            <video playsinline controls autoplay muted width="100%">
              <source src="./materials/vis/compressed_goal-0-rendering-ep.mp4" type="video/mp4">
            </video>
            Conditioned on environment and past trajectory.
          <td width=33%>
            <video playsinline controls autoplay muted width="100%">
              <source src="./materials/vis/compressed_goal-0-animation-epu.mp4" type="video/mp4">
            </video>
            <video playsinline controls autoplay muted width="100%">
              <source src="./materials/vis/compressed_goal-0-rendering-epu.mp4" type="video/mp4">
            </video>
            Conditioned on environment, past trajectory, and user trajectory.
        </table>
      </td>
    </tr>
  </table>
  <hr>
  <table align=center width=810px>
    <tr>
      <td>
        <h3>Path denoising (w/ different conditioning signals)</h3>
        <table align=center width=810px>
          Scenario: Jumping off the sofa
          <tr>
            <td width=49%>
              <video playsinline controls autoplay muted width="100%">
                <source src="./materials/vis/compressed_wp-0-animation-noe.mp4" type="video/mp4">
              </video>
              <video playsinline controls autoplay muted width="100%">
                <source src="./materials/vis/compressed_wp-0-rendering-noe.mp4" type="video/mp4">
              </video>
              No environment conditioning.
            <td width=49%>
              <video playsinline controls autoplay muted width="100%">
                <source src="./materials/vis/compressed_wp-0-animation-e.mp4" type="video/mp4">
              </video>
              <video playsinline controls autoplay muted width="100%">
                <source src="./materials/vis/compressed_wp-0-rendering-e.mp4" type="video/mp4">
              </video>
              Environment conditioning.
          </tr>
        </table>
        <table align=center width=810px>
          <tr>
            <td>
        </table>
      </td>
    </tr>
  </table>

  <hr>
  <table align=center width=810px>
    <tr>
      <td>
        <h3>Body motion denoising</h3>
        <table align=center width=810px>
          Scenario: Following a path
          <tr>
            <td width=100%>
              <video playsinline controls autoplay muted width="58%">
                <source src="./materials/vis/compressed_joints-0-animation.mp4" type="video/mp4">
              </video>
              <video playsinline controls autoplay muted width="41%">
                <source src="./materials/vis/compressed_joints-0-rendering.mp4" type="video/mp4">
              </video>
          </tr>
        </table>
        <table align=center width=810px>
          <tr>
            <td>
        </table>
      </td>
    </tr>
  </table>

</div>
</body>

</html>
