In this work, we present an approach to construct a video-based robot policy capable of successfully executing diverse tasks across different robots and environments without the need of any action annotations. Our method leverages images as a task-agnostic representation, encoding both the state and action information. By synthesizing videos that "hallucinate " robot executing actions and in combination with dense correspondences between frames, our approach can infer the closed-formed action to execute to an environment without the need of any explicit action labels. This unique capability allows us to train the policy solely based on RGB videos and deploy learned policies to various robotic tasks. We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks. Additionally, we contribute an open-source framework for efficient video modeling, enabling the training of high-fidelity policy models using just 4 GPUs within a single day.
Meta-World (Yu et al., 2019) is a simulated benchmark featuring various manipulation tasks with a Sawyer robot arm. We present the video plans synthesized by our video diffusion model as well as robot execution videos as follows.
iTHOR (Kolve et al., 2017) is a simulated benchmark for embodied common sense reasoning. We consider the object navigation tasks for evaluation, where an agent randomly initialized into a scene learns to navigate to an object of a given type (e.g., toaster, television). We present the video plans synthesized by our video diffusion model as well as robot navigation videos as follows.
We aim to examine if our method can achieve cross-embodiment learning, e.g., leverage human demonstration videos to control robots to solve tasks. To this end, we learn a video diffusion model from only actionless human pushing videos from Visual Pusher (Schmeckpeper et al., 2021, Zakka et al., 2022) and then evaluate our method on simulated robot pushing tasks without any fine-tuning. We present the video plans synthesized by our video diffusion model as well as robot execution videos as follows.
We aim to examine if our method can tackle real-world robotics tasks. To this end, To this end, we train our video generation model on the Bridge dataset (Ebert et al., 2022), and perform evaluation on a real-world Franka Emika Panda tabletop manipulation environment. We present the video plans synthesized by our video diffusion model as well as robot execution videos as follows.
While most tasks in the Bridge data were recorded in toy kitchens, we found that the video diffusion model trained on this dataset already can generalize to complex real-world kitchen scenarios, producing reasonable videos given RGB images and textual task descriptions. We present some examples of the synthesized videos below. Note that the videos are blurry since the original video resolution is low (48x64).
We compare our proposed first-frame conditioning strategy (cat_c) with the naive frame-wise concatenate strategy (cat_t). Our method (cat_c) consistently outperforms the frame-wise concatenating baseline (cat_t) when training on the Bridge dataset. Below we provide some qualitative examples of synthesized videos with 40k training steps.
This section investigates the possibility of accelerating the sampling process using Denoising Diffusion Implicit Models (DDIM; Song et al., 2021). To this end, instead of iterative denoising 100 steps, as reported in the main paper, we have experimented with different numbers of denoising steps (e.g., 25, 10, 5, 3) using DDIM. We found that we can generate high-fidelity videos with only 1/10 of the samplimg steps (10 steps) with DDIM, allowing for tackling running time-critical tasks. We present the synthesized videos with 25, 10, 5, 3 denoising steps as follows.