Zero-Shot Robotic Manipulation with Pre-Trained Image-Editing Diffusion Models

Kevin Black; Mitsuhiko Nakamoto; Pranav Atreya; Homer Walke; Chelsea Finn; Aviral Kumar; Sergey Levine

Zero-Shot Robotic Manipulation with Pre-Trained Image-Editing Diffusion Models

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, Sergey Levine

Published: 03 Nov 2023, Last Modified: 27 Nov 2023GCRL WorkshopEveryoneRevisionsBibTeX

Keywords: robot learning, diffusion model

Abstract: If generalist robots are to operate in truly unstructured environments, they need to be able to recognize and reason about novel objects and scenarios. Such objects and scenarios might not be present in the robot's own training data. We propose SuSIE, a method that leverages an image editing diffusion model to act as a high-level planner by proposing intermediate subgoals that a low-level controller attains. Specifically, we fine-tune InstructPix2Pix on robot data such that it outputs a hypothetical future observation given the robot's current observation and a language command. We then use the same robot data to train a low-level goal-conditioned policy to reach a given image observation. We find that when these components are combined, the resulting system exhibits robust generalization capabilities. The high-level planner utilizes its Internet-scale pre-training and visual understanding to guide the low-level goal-conditioned policy, achieving significantly better generalization than conventional language-conditioned policies. We demonstrate that this approach solves real robot control tasks involving novel objects, distractors, and even environments, both in the real world and in simulation.

Confirmation: I have read and confirm that at least one author will be attending the workshop in person if the submission is accepted

Submission Number: 29

Loading