Keywords: Hierarchical Imitation Learning, Image Generation, Video Prediction, Robot Learning
TL;DR: A method to improve hierarchical imitation learning methods that use pre-trained image or video generative models.
Abstract: Image and video generative models that are pre-trained on Internet scale data can increase the generalization capacity of robot learning systems. These models can function as high-level planners, generating intermediate subgoals for low-level goal-conditioned policies to reach. However, the performance of these systems can be bottlenecked by the interface between generative models and low-level controllers. Generative models may predict photorealistic yet physically infeasible frames. Low-level policies may also be sensitive to subtle visual artifacts in generated goal images. This paper addresses these facets of generalization, providing an interface to “glue together” language-conditioned image or video prediction models with low-level goal-conditioned policies. Our method, Generative Hierarchical Imitation Learning-Glue (GHIL-Glue), filters out subgoals that do not lead to task progress and improves the robustness of goal-conditioned policies to generated subgoals with harmful visual artifacts. GHIL-Glue achieves a new state-of-the-art on the CALVIN simulation benchmark for policies using observations from a single RGB camera. GHIL-Glue also outperforms other generalist robot policies across 3/4 language-conditioned manipulation tasks testing zero-shot generalization on a physical robot.
Submission Number: 42
Loading