Short Trajectory
RT-1
Prediction
Short Trajectory
RT-1
Ground-truth
Long Trajectory
RT-1
Prediction
Long Trajectory
RT-1
Ground-truth
Short Trajectory
Bridge
Prediction
Short Trajectory
Bridge
Ground-truth
Long Trajectory
Bridge
Prediction
Long Trajectory
Bridge
Ground-truth
Short Trajectory
Language-Table
Prediction
Short Trajectory
Language-Table
Ground-truth
Long Trajectory
Language-Table
Prediction
Long Trajectory
Language-Table
Ground-truth

Mani-WM: An Interactive World Model for Real-Robot Manipulation

ICLR 2025 Submission #6028

Scalable robot learning in the real world is limited by the cost and safety issues of real robots. In addition, rolling out robot trajectories in the real world can be time-consuming and labor-intensive. In this paper, we propose to learn an interactive world model for robot manipulation as an alternative. We present a novel method, Mani-WM, which leverages the power of generative models to generate realistic videos of a robot arm executing a given action trajectory, starting from an initial given frame. Mani-WM employs a novel frame-level conditioning technique to ensure precise alignment between actions and video frames and leverages a diffusion transformer for high-quality video generation. To validate the effectiveness of Mani-WM, we perform extensive experiments on four challenging real-robot datasets. Results show that Mani-WM outperforms all the comparing baseline methods and is more preferable in human evaluations. We further showcase the flexible action controllability of Mani-WM by controlling the virtual robots in datasets with trajectories 1) predicted by an autonomous policy and 2) collected by a keyboard or VR controller. Finally, we combine Mani-WM with model-based planning to showcase its usefulness on real-robot manipulation tasks. We hope that Mani-WM can serve as an effective and scalable approach to enhance robot learning in the real world.


Video Generation as World Model

We create an interactive real-robot manipulation world model that can simulate robot trajectories in a way that is accurate and almost visually indistinguishable from the real world. With such a world model, agents can interactively control virtual robots to interact with diverse objects in various scenes and perform model-based planning by imagining the outcomes of different proposed candidate trajectories.

Figure 1: Overview of Mani-WM.Mani-WM is an interactive world model that allows users to input an action trajectory to control the "real robot" in an initial frame.

Trajectory-conditioned Video Generation

Mani-WM is a novel method that generates extremely realistic videos of a robot that executes an action trajectory, starting from a given initial frame. We refer to this task as the trajectory-to-video task. The trajectory-to-video task differs from the general text-to-video task. While various videos can meet the text condition in the text-to-video task, the predicted video in our trajectory-to-video task must strictly and accurately follow the input trajectory. More importantly, a challenge of this task is that each action in the trajectory provides an exact description of the robot's movement in each frame. This contrasts with the text-to-video task, where textual descriptions offer a general condition without specific frame-by-frame details. Another challenge is that the trajectory-to-video task features rich robot-object interactions, which must adhere to physical laws. Mani-WM leverages an innovative frame-level condition method to achieve precise frame-by-frame alignment between actions and video frames. We use the powerful Diffusion Transformer as the backbone of Mani-WM to improve the modeling of robot-object interactions. Mani-WM can generate realistic videos of high-resolution (up to 288 × 512) and long-horizon (up to 150+ frames).

Figure 2: Network Architecture of Mani-WM. (a) shows the general diffusion transformer architecture of Mani-WM. The input to Mani-WM includes the initial frame and the given trajectory. (b) Frame-level adaptation (Frame-Ada). (c) Video-level adaptation (Video-Ada).

Short Trajectory Prediction

Uncurated qualitative results of short trajectories are shown below. Click the Click to View More button to display another random subset from 100 unpicked samples for each dataset. All samples are from the test set. Each video contains 16 frames with 4 fps. The video on the left is generated by Mani-WM, while the video on the right is the ground truth.


Click to View More

Long Trajectory Prediction

Uncurated qualitative results of long trajectories are shown below. Click the Click to View More button to display another random subset from 100 unpicked episodes for each dataset. Click the Click to View Very Long Videos button to display the six longest videos from the 100 unpicked episodes. Hover over on these longest videos to see their number of frames. All episodes are from the test set. The average number of frames of the 100 unpicked episodes are 47.04, 36.43, and 24.57 for RT-1, Bridge, and Language-Table, respectively. The video on the left is generated by Mani-WM; the video on the right is the ground truth. Mani-WM retains the powerful capability of generating visually realistic and accurate videos of long-horizon as in the short trajectory setting.


Click to View More
Click to View Very Long Videos

Scaling

We follow DiT and train Mani-WM-Frame-Ada of different model sizes ranging from 33M to 679M. Results are shown in Fig. 4. On all three datasets, Mani-WM scales elegantly with the increase of model sizes and training steps. This indicates strong potential for increasing model sizes and training steps to further improve the performance.

Figure 4: Scaling. Mani-WM scales elegantly with the increase of model sizes and training steps

Flexible Action Controllability

To showcase the flexible action controllability of Mani-WM, we conduct qualitative experiments in which the virtual robot is guided by trajectories generated from three distinct input sources: a keyboard, a VR controller, and a policy. Importantly, these trajectories exhibit distributions that differ from those in the original dataset. For Language-Table with a 2D translation action space, we use the arrow keys from the keyboard to input action trajectories. For RT-1 and Bridge with a 3D action space, we use a VR controller to collect action trajectories as input. We also train Mani-WM on our own robot dataset and leverage a well-trained policy with action chunk techniques to predict the trajectories. We compare the video generated by Mani-WM with the corresponding real-robot rollout. Videos below show that Mani-WM can accurately follow trajectories from different input sources, beyond the training domain. Additionally, Mani-WM is able to robustly handle multimodality in generation,i.e., generating corresponding videos with an identical initial frame but different trajectories. In the Appendix in the paper, we also demonstrate that Mani-WM can handle noisy and physically implausible trajectories.

"Controlling" the Robot in Language-Table with a Keyboard

Language-Table
Prediction
16 frames
Language-Table
Prediction
16 frames

"Controlling" the Robot in RT-1 with a VR Controller

RT-1
Prediction
47 frames
RT-1
Ground-truth
47 frames

"Controlling" the Robot in Bridge with a VR Controller

Bridge
Prediction
17 frames
Bridge
Ground-truth
17 frames

Real-Robot Model-based Planning Experiment

We conduct a real-robot model-based planning experiment to show the usefulness of Mani-WM for manipulation task. The experiment demonstrates that Mani-WM can effectively plan trajectories to finish manipulation tasks by predicting the outcomes of executing different candidate trajectories.

Video results: The left column includes the Initial Image and the Goal Image, while each column on the right shows the real execution video at the top and the predicted video at the bottom. The trajectory is selected from the sampled candidate trajectories based on the similarity between the predicted video and the goal image. Our videos include both successful and failed examples for each method.

Task: Close Drawer

Initial Image

Start Image
Goal Image

Goal Image


Mani-WM (ResNet) Success

Execute Video

Predict Video

Mani-WM (ResNet) Fail

Execute Video

Predict Video

Mani-WM (MSE) Success

Execute Video

Predict Video

Mani-WM (MSE) Fail

Execute Video

Predict Video

Random Success

Execute Video

Predict Video

Random Fail

Execute Video

Predict Video

Task: Place Mandarin on Green Plate

Initial Image

Start Image
Goal Image

Goal Image

Mani-WM (ResNet) Success

Execute Video

Predict Video

Mani-WM (ResNet) Fail

Execute Video

Predict Video

Mani-WM (MSE) Success

Execute Video

Predict Video

Mani-WM (MSE) Fail

Execute Video

Predict Video

Random Success

Execute Video

Predict Video

Random Fail

Execute Video

Predict Video

Task: Place Mandarin on Red Plate

Initial Image

Start Image
Goal Image

Goal Image

Mani-WM (ResNet) Success

Execute Video

Predict Video

Mani-WM (ResNet) Fail

Execute Video

Predict Video

Mani-WM (MSE) Success

Execute Video

Predict Video

Mani-WM (MSE) Fail

Execute Video

Predict Video

Random Success

Execute Video

Predict Video

Random Fail

Execute Video

Predict Video