Learning to Act from Actionless Video
through Dense Correspondences


Anonymous Authors

In this work, we present an approach to construct a video-based robot policy capable of successfully executing diverse tasks across different robots and environments without the need of any action annotations. Our method leverages images as a task-agnostic representation, encoding both the state and action information. By synthesizing videos that "hallucinate " robot executing actions and in combination with dense correspondences between frames, our approach can infer the closed-formed action to execute to an environment without the need of any explicit action labels. This unique capability allows us to train the policy solely based on RGB videos and deploy learned policies to various robotic tasks. We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks. Additionally, we contribute an open-source framework for efficient video modeling, enabling the training of high-fidelity policy models using just 4 GPUs within a single day.


Table of contents



Framework Overview



Extended Qualitative Results


Meta-World


Meta-World (Yu et al., 2019) is a simulated benchmark featuring various manipulation tasks with a Sawyer robot arm. We present the video plans synthesized by our video diffusion model as well as robot execution videos as follows.

Synthesized Videos


Robot Executions

Task: Assembly
Task: Door Open
Task: Hammer
Task: Shelf Place


iTHOR


iTHOR (Kolve et al., 2017) is a simulated benchmark for embodied common sense reasoning. We consider the object navigation tasks for evaluation, where an agent randomly initialized into a scene learns to navigate to an object of a given type (e.g., toaster, television). We present the video plans synthesized by our video diffusion model as well as robot navigation videos as follows.

Synthesized Videos


Robot Navigation

Task: Pillow
Task: Soap Bar
Task: Television
Task: Toaster


Cross-Embodiment Learning (Visual Pusher)


We aim to examine if our method can achieve cross-embodiment learning, e.g., leverage human demonstration videos to control robots to solve tasks. To this end, we learn a video diffusion model from only actionless human pushing videos from Visual Pusher (Schmeckpeper et al., 2021, Zakka et al., 2022) and then evaluate our method on simulated robot pushing tasks without any fine-tuning. We present the video plans synthesized by our video diffusion model as well as robot execution videos as follows.

Failed executions


input image
video plan
execution


input image
video plan
execution


Successful executions


input image
video plan
execution


input image
video plan
execution



Real-World Franka Emika Panda Arm with Bridge Dataset


We aim to examine if our method can tackle real-world robotics tasks. To this end, To this end, we train our video generation model on the Bridge dataset (Ebert et al., 2022), and perform evaluation on a real-world Franka Emika Panda tabletop manipulation environment. We present the video plans synthesized by our video diffusion model as well as robot execution videos as follows.

Synthesized Videos


Robot Executions

Task: put apple in plate
Task: put banana in plate
Task: put peach in blue bowl


Zero-Shot Generalization on Real-World Scene with Bridge Model


While most tasks in the Bridge data were recorded in toy kitchens, we found that the video diffusion model trained on this dataset already can generalize to complex real-world kitchen scenarios, producing reasonable videos given RGB images and textual task descriptions. We present some examples of the synthesized videos below. Note that the videos are blurry since the original video resolution is low (48x64).

Task: pick up banana
generated video
Task: put lid on pot
generated video


Task: put pot in sink
generated video


Comparison of First-Frame Conditioning Strategy and
Different Text Encoders


We compare our proposed first-frame conditioning strategy (cat_c) with the naive frame-wise concatenate strategy (cat_t). Our method (cat_c) consistently outperforms the frame-wise concatenating baseline (cat_t) when training on the Bridge dataset. Below we provide some qualitative examples of synthesized videos with 40k training steps.


Improving Inference Efficiency with
Denoising Diffusion Implicit Models


This section investigates the possibility of accelerating the sampling process using Denoising Diffusion Implicit Models (DDIM; Song et al., 2021). To this end, instead of iterative denoising 100 steps, as reported in the main paper, we have experimented with different numbers of denoising steps (e.g., 25, 10, 5, 3) using DDIM. We found that we can generate high-fidelity videos with only 1/10 of the samplimg steps (10 steps) with DDIM, allowing for tackling running time-critical tasks. We present the synthesized videos with 25, 10, 5, 3 denoising steps as follows.

DDIM 25 steps: The quality of the synthesized videos are satisfactory depsite minor temporal inconsistency (gripper/object disappeared/duplicated) compared to our DDPM (100 steps) videos reported in previous section.


DDIM 10 steps: The quality of the synthesized videos is similar to those generated with 25 steps.


DDIM 5 steps: The temporal inconsistency issue is more severe with only 5 denoising steps.


DDIM 3 steps: The temporal inconsistency issue is more severe and some objects are blurry.