<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta name="description"
          content="Hierarchical Planning with Foundation Models">

    <title>Hierarchical Planning with Foundation Models</title>
    <!-- Bootstrap core CSS -->
    <!--link href="bootstrap.min.css" rel="stylesheet"-->
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css"
          integrity="sha384-Gn5384xqQ1aoWXA+058RXPxPg6fy4IWvTNh0E263XmFcJlSAwiGgFAW/dAiS6JXm" crossorigin="anonymous">

    <!-- Custom styles for this template -->
    <link href="offcanvas.css" rel="stylesheet">
    <!--    <link rel="icon" href="img/favicon.gif" type="image/gif">-->
</head>

<body>
<div class="jumbotron jumbotron-fluid">
    <div class="container"></div>
    <h2>Hierarchical Planning with Foundation Models</h2>
</div>

<div class="container">
    <div class="section">
        <p>
            To make effective decisions in novel environments with long-horizon goals, it is crucial to engage in hierarchical reasoning across spatial and temporal scales. This entails planning abstract subgoal sequences, visually reasoning about the underlying plans, and executing actions in accordance with the devised plan through visual-motor control. We propose <i>Hierarchical Planning with Foundation Models</i> (<TT>HiP</TT>), a framework that leverages different modalities of knowledge to capture information supporting the different levels of decision-making. We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model. Generated video plans are then grounded to visual-motor control, through  an inverse dynamics model that infers actions from generated videos. To enable effective reasoning within this hierarchy, we enforce consistency between the models via <i>iterative refinement</i>. We illustrate the efficacy and adaptability of our approach in two different long-horizon table-top manipulation tasks. 
        </p>
    </div>
    
    <div class="list-group">
        <center>
        <img src="img/teaser_4.jpg" style="width:100%;">
        </center>
    </div>

    <br>
    <br>
    <div class="section">
        <h2>Paint Block Results</h2>
        <hr>
        <p>
            <center>Successful execution trajectories of HiP on novel long-horizon tasks in paint-block environment.</center>
        </p>
        <br>
        <div class="row align-items-center">
            <div class="col justify-content-center text-center">
                <video width="100%" playsinline="" autoplay="" loop="" preload="" muted="">
                    <source src="img/paint_block_1.mp4" type="video/mp4">
                </video>
                <div class="overlay">
                    <p><b>Goal:</b> Place purple block left of yellow block and cyan block right of yellow block</p>
                </div>
            </div> 
            <div class="col justify-content-center text-center">
                <video width="100%" playsinline="" autoplay="" loop="" preload="" muted="">
                    <source src="img/paint_block_2.mp4" type="video/mp4">
                </video>
                <div class="overlay">
                    <p><b>Goal:</b> Stack red block on top of brown block and place yellow block to the left of the stack</p>
                </div>
            </div> 
        </div>
        <div class="row align-items-center">
            <div class="col justify-content-center text-center">
                <video width="100%" playsinline="" autoplay="" loop="" preload="" muted="">
                    <source src="img/paint_block_3.mp4" type="video/mp4">
                </video>
                <div class="overlay">
                    <p><b>Goal:</b> Stack brown block on top of pink block and place cyan block to the left of the stack</p>
                </div>
            </div> 
            <div class="col justify-content-center text-center">
                <video width="100%" playsinline="" autoplay="" loop="" preload="" muted="">
                    <source src="img/paint_block_4.mp4" type="video/mp4">
                </video>
                <div class="overlay">
                    <p><b>Goal:</b> Stack orange block on top of red block and place purple block to the right of the stack</p>
                </div>
            </div> 
        </div>
    </div>

    <br>
    <br>
    <div class="section">
        <h2>Object Arrange Results</h2>
        <hr>
        <p>
            <center>Successful execution trajectories of HiP on novel long-horizon tasks in object-arrange environment.</center>
        </p>
        <br>
        <div class="row align-items-center">
            <div class="col justify-content-center text-center">
                <video width="100%" playsinline="" autoplay="" loop="" preload="" muted="">
                    <source src="img/object_arrange_1.mp4" type="video/mp4">
                </video>
                <div class="overlay">
                    <p><b>Goal:</b> Pack spiderman figure, frypan, nintendo 3ds, red and white striped towel in brown box</p>
                </div>
            </div> 
            <div class="col justify-content-center text-center">
                <video width="100%" playsinline="" autoplay="" loop="" preload="" muted="">
                    <source src="img/object_arrange_2.mp4" type="video/mp4">
                </video>
                <div class="overlay">
                    <p><b>Goal:</b> Pack butterfinger chocolate, porcelain salad plate, porcelain spoon, green and white striped towel in brown box</p>
                </div>
            </div> 
        </div>
        <div class="row align-items-center">
            <div class="col justify-content-center text-center">
                <video width="100%" playsinline="" autoplay="" loop="" preload="" muted="">
                    <source src="img/object_arrange_3.mp4" type="video/mp4">
                </video>
                <div class="overlay">
                    <p><b>Goal:</b> Pack spiderman figure, porcelain salad plate, nintendo cartridge, hammer in brown box</p>
                </div>
            </div> 
            <div class="col justify-content-center text-center">
                <video width="100%" playsinline="" autoplay="" loop="" preload="" muted="">
                    <source src="img/object_arrange_4.mp4" type="video/mp4">
                </video>
                <div class="overlay">
                    <p><b>Goal:</b> Pack crayon box, ball puzzle, hammer, red and white striped towel in brown box</p>
                </div>
            </div> 
        </div>
    </div>

</div>


<script src="https://code.jquery.com/jquery-3.5.1.slim.min.js"
        integrity="sha384-DfXdz2htPH0lsSSs5nCTpuj/zy4C+OGpamoFVy38MVBnE+IbbVYUew+OrCXaRkfj"
        crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/popper.js@1.16.0/dist/umd/popper.min.js"
        integrity="sha384-Q6E9RHvbIyZFJoft+2mJbHaEWldlvI9IOYy5n3zV9zzTtmI3UksdQRVvoxMfooAo"
        crossorigin="anonymous"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.5.0/js/bootstrap.min.js"
        integrity="sha384-OgVRvuATP1z7JjHLkuOU7Xw704+h835Lr+6QL9UvYjZE3Ipu6Tp75j7Bh/kR0JKI"
        crossorigin="anonymous"></script>

</body>
</html>
