<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <style>
        .container {
            display: grid;
            grid-template-columns: repeat(7, 1fr);
            grid-gap: 10px;
            text-align: center;
        }

        .item {
            display: flex;
            flex-direction: column;
            align-items: center; /* 修改这行 */
        }

        .item img, .item video {
            max-width: 100%;
            height: auto;
        }

        .header {
            grid-column: span 7;
            text-align: center;
            font-size: 24px;
            margin-bottom: 20px;
        }

        .start-goal {
            display: flex;
            flex-direction: column; /* 修改这行 */
            justify-content: space-between;
            align-items: center;
            grid-row: span 2;
            display: block;

        }

        .start-goal img {
            margin: 0; /* 确保没有额外的边距 */
        }
    </style>
    <link rel="stylesheet" href="main.css">
    <link rel="icon" type="image/x-icon" href="assets/favicon.png">
    <title>Mani-WM: An Interactive World Model for Real-Robot Manipulation</title>
</head>
<body>
    <div id="video_grid">
        <div class="video_wrapper">
            <div class="video_container">
                <video autoplay muted playsinline loop>
                    <source src="assets/videos/short/rt1.mp4" type="video/mp4">
                </video>
                <div class="overlay1 left">Short Trajectory<br>RT-1<br>Prediction</div>
                <div class="overlay1 right">Short Trajectory<br>RT-1<br>Ground-truth</div>
            </div>
        </div>
        <div class="video_wrapper">
            <div class="video_container">
                <video autoplay muted playsinline loop>
                    <source src="assets/videos/long/rt1.mp4" type="video/mp4">
                </video>
                <div class="overlay1 left">Long Trajectory<br>RT-1<br>Prediction</div>
                <div class="overlay1 right">Long Trajectory<br>RT-1<br>Ground-truth</div>
            </div>
        </div>

        <div class="video_wrapper">
            <div class="video_container">
                <video autoplay muted playsinline loop>
                    <source src="assets/videos/short/bridge.mp4" type="video/mp4">
                </video>
                <div class="overlay1 left">Short Trajectory<br>Bridge<br>Prediction</div>
                <div class="overlay1 right">Short Trajectory<br>Bridge<br>Ground-truth</div>
            </div>
        </div>
        <div class="video_wrapper">
            <div class="video_container">
                <video autoplay muted playsinline loop>
                    <source src="assets/videos/long/bridge.mp4" type="video/mp4">
                </video>
                <div class="overlay1 left">Long Trajectory<br>Bridge<br>Prediction</div>
                <div class="overlay1 right">Long Trajectory<br>Bridge<br>Ground-truth</div>
            </div>
        </div>
        <div class="video_wrapper">
            <div class="video_container">
                <video autoplay muted playsinline loop>
                    <source src="assets/videos/short/languagetable.mp4" type="video/mp4">
                </video>
                <div class="overlay1 left">Short Trajectory<br>Language-Table<br>Prediction</div>
                <div class="overlay1 right">Short Trajectory<br>Language-Table<br>Ground-truth</div>
            </div>
        </div>
        <div class="video_wrapper">
            <div class="video_container">
                <video autoplay muted playsinline loop>
                    <source src="assets/videos/long/languagetable.mp4" type="video/mp4">
                </video>
                <div class="overlay1 left">Long Trajectory<br>Language-Table<br>Prediction</div>
                <div class="overlay1 right">Long Trajectory<br>Language-Table<br>Ground-truth</div>
            </div>
        </div>
       
    </div>
    
<div id="title_slide">
    <div class="title_left">
        <h1 style="text-align: center;">Mani-WM: An Interactive World Model for Real-Robot Manipulation</h1>
        <div class="author-container-1">
            <div class="grid-item">ICLR 2025 Submission #6028</div>
        </div>

        <br>

        <div id="abstract" class="grid-container">
            <p>
                Scalable robot learning in the real world is limited by the cost and safety issues
                of real robots. In addition, rolling out robot trajectories in the real world can
                be time-consuming and labor-intensive. In this paper, we propose to learn an
                interactive world model for robot manipulation as an alternative. We present a
                novel method, Mani-WM, which leverages the power of generative models to
                generate realistic videos of a robot arm executing a given action trajectory, starting
                from an initial given frame. Mani-WM employs a novel frame-level conditioning
                technique to ensure precise alignment between actions and video frames and
                leverages a diffusion transformer for high-quality video generation. To validate the
                effectiveness of Mani-WM, we perform extensive experiments on four challenging
                real-robot datasets. Results show that Mani-WM outperforms all the comparing
                baseline methods and is more preferable in human evaluations. We further showcase
                the flexible action controllability of Mani-WM by controlling the virtual robots in
                datasets with trajectories 1) predicted by an autonomous policy and 2) collected by
                a keyboard or VR controller. Finally, we combine Mani-WM with model-based
                planning to showcase its usefulness on real-robot manipulation tasks. We hope that
                Mani-WM can serve as an effective and scalable approach to enhance robot learning
                in the real world.
            </p>
        </div>
    </div>
</div>
<hr class="rounded">
<div id="overview">
    
    <h1 style="text-align: center;">Video Generation as World Model</h1>
    <p>
        We create an interactive real-robot manipulation world model that can simulate robot trajectories in a way that is accurate and almost visually indistinguishable from the real world.
        With such a world model, agents can interactively control virtual robots to interact with diverse objects in various scenes and perform
        model-based planning by imagining the outcomes of different proposed candidate trajectories.
    </p>

    <div class="barplot">
        <div class="image_container">
            <img src="assets/images/intro.png">
            <div class="caption" style="margin-top: 0.0vw">
                <p>Figure 1: <strong>Overview of Mani-WM.</strong>Mani-WM is an interactive world model that allows users to input an
                    action trajectory to control the "real robot" in an initial frame.</p>
            </div>
        </div>
    </div>

    <h1 style="text-align: center;">Trajectory-conditioned Video Generation</h1>

    <p>
        Mani-WM is a novel method that generates extremely realistic videos of a robot that executes an action trajectory, starting from a given initial frame.
        We refer to this task as the <em>trajectory-to-video</em> task.
        The trajectory-to-video task differs from the general text-to-video task. 
        While various videos can meet the text condition in the text-to-video task, the predicted video in our trajectory-to-video task must strictly and accurately follow the input trajectory.
        More importantly, a challenge of this task is that each action in the trajectory provides an exact description of the robot's movement in each frame.
        This contrasts with the text-to-video task, where textual descriptions offer a general condition without specific frame-by-frame details.
        Another challenge is that the trajectory-to-video task features rich robot-object interactions, which must adhere to physical laws.
        Mani-WM leverages an innovative frame-level condition method to achieve precise frame-by-frame alignment between actions and video frames. 
        We use the powerful Diffusion Transformer as the backbone of Mani-WM to improve the modeling of robot-object interactions. 
        <strong>Mani-WM can generate realistic videos of high-resolution (up to 288 × 512) and long-horizon (up to 150+ frames).</strong>

    </p>

    <div class="barplot">
        <div class="image_container">
            <img src="assets/images/network.png">
            <div class="caption" style="margin-top: 0.0vw">
                <p>
                    Figure 2: <strong>Network Architecture of Mani-WM.</strong> (a) shows the general diffusion transformer architecture of Mani-WM. The input to Mani-WM includes the initial frame and the given trajectory. (b) Frame-level adaptation (Frame-Ada). (c) Video-level adaptation (Video-Ada).
                </p>
            </div>
        </div>
    </div>

    <h1 style="text-align: center;">Short Trajectory Prediction</h1>
    <p>
        <strong>Uncurated</strong> qualitative results of short trajectories are shown below.
        Click the <em>Click to View More</em> button to display another random subset from 100 unpicked samples for each dataset.
        All samples are from the test set.
        Each video contains 16 frames with 4 fps. 
        The video on the left is generated by Mani-WM, while the video on the right is the ground truth.
    </p>
    </br>
    <div class="button-container">
        <div class="button_click" style="padding: 10px 20px; margin: -30px 20px 20px 20px;" onclick="sample_short_videos()">Click to View More</div>
    </div>
    <div id="uncurated_short_video_grid"></div>
    <br>


    <h1 style="text-align: center;">Long Trajectory Prediction</h1>
    <p>
        <strong>Uncurated</strong> qualitative results of long trajectories are shown below.
        Click the <em>Click to View More</em> button to display another random subset from 100 unpicked episodes for each dataset.
        Click the <em>Click to View Very Long Videos</em> button to display the six longest videos from the 100 unpicked episodes.
        Hover over on these longest videos to see their number of frames.
        All episodes are from the test set. 
        The average number of frames of the 100 unpicked episodes are 47.04, 36.43, and 24.57 for RT-1, Bridge, and Language-Table, respectively.
        The video on the left is generated by Mani-WM; the video on the right is the ground truth. 
        Mani-WM retains the powerful capability of generating visually realistic and accurate videos of long-horizon as in the short trajectory setting.  
    </p>
    </br>
    <div class="button-container">
        <div class="button_click" style="padding: 10px 20px; margin: -30px 20px 20px 20px;" onclick="sample_long_videos()">Click to View More</div>
        <div class="button_click_long" style="padding: 10px 20px; margin: -30px 20px 20px 20px;" onclick="sample_longest_videos()">Click to View Very Long Videos</div>
    </div>
    <div id="uncurated_long_video_grid"></div>
    <br>

    <h1 style="text-align: center;">Scaling</h1>
    <p>
        We follow DiT and train Mani-WM-Frame-Ada of different model sizes ranging from 33M to 679M. 
        Results are shown in Fig. 4. 
        On all three datasets, Mani-WM scales elegantly with the increase of model sizes and training steps. 
        This indicates strong potential for increasing model sizes and training steps to further improve the performance.
    </p>
    <div class="barplot">
        <div class="image_container">
            <img src="assets/images/scale.png">
            <div class="caption" style="margin-top: 0.0vw; margin-bottom: 0vw; text-align: center;">
                <p>
                    Figure 4: <strong>Scaling.</strong> Mani-WM scales elegantly with the increase of model sizes and training steps
                </p>
            </div>
        </div>
    </div>

    <h1 style="text-align: center;">
        Flexible Action Controllability
    </h1>

    <p>
        To showcase the flexible action controllability of Mani-WM, we conduct qualitative experiments in which the virtual robot is guided by trajectories generated from three distinct input sources: a keyboard, a VR controller, and a policy. Importantly, these trajectories exhibit distributions that differ from those in the original dataset.
        For Language-Table with a 2D translation action space, we use the arrow keys from the keyboard to input action trajectories.
        For RT-1 and Bridge with a 3D action space, we use a VR controller to collect action trajectories as input.
        We also train Mani-WM on our own robot dataset and leverage a well-trained policy with action chunk techniques to predict the trajectories. 
        We compare the video generated by Mani-WM with the corresponding real-robot rollout.
        Videos below show that Mani-WM can accurately follow trajectories from different input sources, beyond the training domain.
        Additionally, Mani-WM is able to robustly handle multimodality in generation,i.e., generating corresponding videos with an identical initial frame but different trajectories. 
        In the Appendix in the paper, we also demonstrate that Mani-WM can handle noisy and physically implausible trajectories.
    </p>

    <h2 style="text-align: center; margin: 20px 20px 20px 20px; font-weight: normal; font-size: 1.2vw">"Controlling" the Robot in Language-Table with a Keyboard</h2>

    <div id="app_3D_video_grid">
        <div class="video_wrapper">
            <div class="video_container">
                <video autoplay muted playsinline loop>
                    <source src="assets/videos/application/app_languagetable.mp4" type="video/mp4">
                </video>
                <div class="overlay1 left">Language-Table <br> Prediction <br>16 frames </div>
                <div class="overlay1 right">Language-Table <br> Prediction <br>16 frames </div>
            </div>
        </div>
    </div>


    <h2 style="text-align: center; margin: 20px 20px 20px 20px; font-weight: normal; font-size: 1.2vw">"Controlling" the Robot in RT-1 with a VR Controller</h2>

    <div id="app_3D_video_grid">
        <div class="video_wrapper">
            <div class="video_container">
                <video autoplay muted playsinline loop>
                    <source src="assets/videos/application/app_rt1.mp4" type="video/mp4">
                </video>
                <div class="overlay1 left">RT-1 <br> Prediction <br> 47 frames </div>
                <div class="overlay1 right">RT-1 <br> Ground-truth <br> 47 frames </div>
            </div>
        </div>
    </div>

    <h2 style="text-align: center; margin: 20px 20px 20px 20px; font-weight: normal; font-size: 1.2vw">"Controlling" the Robot in Bridge with a VR Controller</h2>

    <div id="app_3D_video_grid">
        <div class="video_wrapper">
            <div class="video_container">
                <video autoplay muted playsinline loop>
                    <source src="assets/videos/application/app_bridge.mp4" type="video/mp4">
                </video>
                <div class="overlay1 left">Bridge <br> Prediction <br> 17 frames</div>
                <div class="overlay1 right">Bridge <br> Ground-truth <br> 17 frames</div>
            </div>
        </div>
    </div>

    <h1 style="text-align: center;margin-bottom: 10px;">Real-Robot Model-based Planning Experiment</h1>
    <p>
        We conduct a real-robot model-based planning experiment to show the usefulness of Mani-WM for manipulation task. The experiment demonstrates that Mani-WM can effectively plan trajectories to finish manipulation tasks by predicting the outcomes of executing different candidate trajectories.
    </p>

    <p><b>Video results:</b> The left column includes the Initial Image and the Goal Image, while each column on the right shows 
        the real execution video at the top and the predicted video at the bottom. The trajectory is selected from the sampled candidate 
        trajectories based on the similarity between the predicted video and the goal image. Our videos include both successful and failed examples for each method.</p>

    <h2 style="text-align: center;margin-bottom: 10px;">Task: Close Drawer</h1>

    <div class="container">
        <!-- First column with Start and Goal Image -->
        <div class="start-goal">
            <div class="item">
                <h3 style="margin-top: 26px;">Initial Image</h3>
                <img src="assets/videos/planning/close_drawer/start_image.png" alt="Start Image">
            </div>
            <div class="item">
                <img src="assets/videos/planning/close_drawer/target_image.png" alt="Goal Image">
                <h3>Goal Image</h3>
                <br>
            </div>
        </div>

        <!-- Other columns with videos -->
        <div class="item">
            <h3 style="font-size: 18px; white-space: nowrap;">Mani-WM (ResNet) Success</h3>
            <h3> Execute Video </h3>
            <video autoplay muted loop>
                <source src="assets/videos/planning/close_drawer/resnet/43_success.mp4" type="video/mp4">
                43_success.mp4
            </video>
            <h3> Predict Video </h3>
        </div>
        <div class="item">
            <h3>Mani-WM (ResNet) Fail</h3>
            <h3> Execute Video </h3>
            <video autoplay muted loop>
                <source src="assets/videos/planning/close_drawer/resnet/48_fail.mp4" type="video/mp4">
                48_fail.mp4
            </video>
            <h3> Predict Video </h3>
        </div>
        <div class="item">
            <h3>Mani-WM (MSE) Success</h3>
            <h3> Execute Video </h3>
            <video autoplay muted loop>
                <source src="assets/videos/planning/close_drawer/mse/45_success.mp4" type="video/mp4">
                45_success.mp4
            </video>
            <h3> Predict Video </h3>
        </div>
        <div class="item">
            <h3>Mani-WM (MSE) Fail</h3>
            <h3> Execute Video </h3>
            <video autoplay muted loop>
                <source src="assets/videos/planning/close_drawer/mse/30_fail.mp4" type="video/mp4">
                30_fail.mp4
            </video>
            <h3> Predict Video </h3>
        </div>
        <div class="item">
            <h3>Random Success</h3>
            <h3> Execute Video </h3>
            <video autoplay muted loop>
                <source src="assets/videos/planning/close_drawer/random/43_success.mp4" type="video/mp4">
                43_success.mp4
            </video>
            <h3> Predict Video </h3>
        </div>
        <div class="item">
            <h3>Random Fail</h3>
            <h3> Execute Video </h3>
            <video autoplay muted loop>
                <source src="assets/videos/planning/close_drawer/random/30_fail.mp4" type="video/mp4">
                30_fail.mp4
            </video>
            <h3> Predict Video </h3>
        </div>
    </div>

    <h2 style="text-align: center;margin-bottom: 10px;">Task: Place Mandarin on Green Plate</h1>

        <div class="container">
            <!-- First column with Start and Goal Image -->
            <div class="start-goal">
                <div class="item">
                    <h3 style="margin-top: 26px;">Initial Image</h3>
                    <img src="assets/videos/planning/place_green/start_image.png" alt="Start Image">
                </div>
                <div class="item">
                    <img src="assets/videos/planning/place_green/target_image.png" alt="Goal Image">
                    <h3>Goal Image</h3>
                </div>
            </div>
    
            <!-- Other columns with videos -->
            <div class="item">
                <h3 style="font-size: 18px; white-space: nowrap;">Mani-WM (ResNet) Success</h3>
                <h3> Execute Video </h3>
            <video autoplay muted loop>
                    <source src="assets/videos/planning/place_green/resnet/39_success.mp4" type="video/mp4">
                    43_success.mp4
                </video>
                <h3> Predict Video </h3>
            </div>
            <div class="item">
                <h3>Mani-WM (ResNet) Fail</h3>
                <h3> Execute Video </h3>
                <video autoplay muted loop>
                    <source src="assets/videos/planning/place_green/resnet/46_fail.mp4" type="video/mp4">
                    48_fail.mp4
                </video>
                <h3> Predict Video </h3>
            </div>
            <div class="item">
                <h3>Mani-WM (MSE) Success</h3>
                <h3> Execute Video </h3>
            <video autoplay muted loop>
                    <source src="assets/videos/planning/place_green/mse/40_success.mp4" type="video/mp4">
                    45_success.mp4
                </video>
                <h3> Predict Video </h3>
            </div>
            <div class="item">
                <h3>Mani-WM (MSE) Fail</h3>
                <h3> Execute Video </h3>
            <video autoplay muted loop>
                    <source src="assets/videos/planning/place_green/mse/44_fail.mp4" type="video/mp4">
                    30_fail.mp4
                </video>
                <h3> Predict Video </h3>
            </div>
            <div class="item">
                <h3>Random Success</h3>
                <h3> Execute Video </h3>
            <video autoplay muted loop>
                    <source src="assets/videos/planning/place_green/random/42_success.mp4" type="video/mp4">
                    43_success.mp4
                </video>
                <h3> Predict Video </h3>
            </div>
            <div class="item">
                <h3>Random Fail</h3>
                <h3> Execute Video </h3>
            <video autoplay muted loop>
                    <source src="assets/videos/planning/place_green/random/16_fail.mp4" type="video/mp4">
                    30_fail.mp4
                </video>
                <h3> Predict Video </h3>
            </div>
        </div>

        <h2 style="text-align: center;margin-bottom: 10px;">Task: Place Mandarin on Red Plate</h1>

            <div class="container">
                <!-- First column with Start and Goal Image -->
                <div class="start-goal">
                    <div class="item">
                        <h3 style="margin-top: 26px;">Initial Image</h3>
                        <img src="assets/videos/planning/place_red/start_image.png" alt="Start Image">
                    </div>
                    <div class="item">
                        <img src="assets/videos/planning/place_red/target_image.png" alt="Goal Image">
                        <h3>Goal Image</h3>
                    </div>
                </div>
        
                <!-- Other columns with videos -->
                <div class="item">
                    <h3 style="font-size: 18px; white-space: nowrap;">Mani-WM (ResNet) Success</h3>
                    <h3> Execute Video </h3>
            <video autoplay muted loop>
                        <source src="assets/videos/planning/place_red/resnet/30_success.mp4" type="video/mp4">
                        43_success.mp4
                    </video>
                    <h3> Predict Video </h3>
                </div>
                <div class="item">
                    <h3>Mani-WM (ResNet) Fail</h3>
                    <h3> Execute Video </h3>
            <video autoplay muted loop>
                        <source src="assets/videos/planning/place_red/resnet/44_fail.mp4" type="video/mp4">
                        48_fail.mp4
                    </video>
                    <h3> Predict Video </h3>
                </div>
                <div class="item">
                    <h3>Mani-WM (MSE) Success</h3>
                    <h3> Execute Video </h3>
            <video autoplay muted loop>
                        <source src="assets/videos/planning/place_red/mse/32_success.mp4" type="video/mp4">
                        45_success.mp4
                    </video>
                    <h3> Predict Video </h3>
                </div>
                <div class="item">
                    <h3>Mani-WM (MSE) Fail</h3>
                    <h3> Execute Video </h3>
            <video autoplay muted loop>
                        <source src="assets/videos/planning/place_red/mse/21_fail.mp4" type="video/mp4">
                        30_fail.mp4
                    </video>
                    <h3> Predict Video </h3>
                </div>
                <div class="item">
                    <h3>Random Success</h3>
                    <h3> Execute Video </h3>
            <video autoplay muted loop>
                        <source src="assets/videos/planning/place_red/random/31_success.mp4" type="video/mp4">
                        43_success.mp4
                    </video>
                    <h3> Predict Video </h3>
                </div>
                <div class="item">
                    <h3>Random Fail</h3>
                    <h3> Execute Video </h3>
            <video autoplay muted loop>
                        <source src="assets/videos/planning/place_red/random/2_fail.mp4" type="video/mp4">
                        30_fail.mp4
                    </video>
                    <h3> Predict Video </h3>
                </div>
            </div>

    
</div>
<script src="sampleVideos.js"></script>
<script>
    document.addEventListener('DOMContentLoaded', (event) => {
        sample_short_videos(); 
    });
    document.addEventListener('DOMContentLoaded', (event) => {
        sample_long_videos(); 
    });
</script>
<script type="text/javascript">
    /* https://stackoverflow.com/questions/3027707/how-to-change-the-playing-speed-of-videos-in-html5 */
    document.querySelector('video').defaultPlaybackRate = 1.0;
    document.querySelector('video').play();

    var videos =document.querySelectorAll('video');
    for (var i=0;i<1;i++)
    {
        videos[i].playbackRate = 1.0;
    }
</script>
<script>
    /* https://stackoverflow.com/questions/21163756/html5-and-javascript-to-play-videos-only-when-visible */
    var videos = document.getElementsByTagName("video");

    function checkScroll() {
        var fraction = 0.5; // Play when 70% of the player is visible.

        for(var i = 0; i < 1; i++) {  // only apply to the first video

            var video = videos[i];

            var x = video.offsetLeft, y = video.offsetTop, w = video.offsetWidth, h = video.offsetHeight, r = x + w, //right
                b = y + h, //bottom
                visibleX, visibleY, visible;

            visibleX = Math.max(0, Math.min(w, window.pageXOffset + window.innerWidth - x, r - window.pageXOffset));
            visibleY = Math.max(0, Math.min(h, window.pageYOffset + window.innerHeight - y, b - window.pageYOffset));

            visible = visibleX * visibleY / (w * h);

            if (visible > fraction) {
                video.play();
            } else {
                video.pause();
            }

        }

    }
    window.addEventListener('scroll', checkScroll, false);
    window.addEventListener('resize', checkScroll, false);
</script>
<script>
    // Function to check if the user is on a mobile device
    function isMobileDevice() {
        return /Mobi|Android/i.test(navigator.userAgent);
    }

    // If the user is on a mobile device, disable autoplay
    if (isMobileDevice()) {
        const videos = document.querySelectorAll('video');
        videos.forEach(video => {
            video.autoplay = false;
        });
    }
</script>
<script src="https://d3e54v103j8qbb.cloudfront.net/js/jquery-3.5.1.min.dc5e7f18c8.js?site=51e0d73d83d06baa7a00000f"
        type="text/javascript" integrity="sha256-9/aliU8dGd2tb6OSsuzixeV4y/faTqgFtohetphbbj0="
        crossorigin="anonymous"></script>
<script src="https://uploads-ssl.webflow.com/51e0d73d83d06baa7a00000f/js/webflow.fd002feec.js"
        type="text/javascript"></script>
</body>
</html>
