Compositional Foundation Models for Hierarchical Planning

Anurag Ajay; Seungwook Han; Yilun Du; Shuang Li; Abhi Gupta; Tommi Jaakkola; Joshua Tenenbaum; Leslie Kaelbling; Akash Srivastava; Pulkit Agrawal

Compositional Foundation Models for Hierarchical Planning

Anurag Ajay, Seungwook Han, Yilun Du, Shuang Li, Abhi Gupta, Tommi Jaakkola, Joshua Tenenbaum, Leslie Kaelbling, Akash Srivastava, Pulkit Agrawal

Published: 07 Nov 2023, Last Modified: 08 Nov 2023FMDM@NeurIPS2023EveryoneRevisionsBibTeX

Keywords: Foundation Models, Hierarchical Planning

TL;DR: Leveraging multiple expert foundation model, trained individually on language, vision and action data, jointly together to solve long-horizon tasks

Abstract: To make effective decisions in novel environments with long-horizon goals, it is crucial to engage in hierarchical reasoning across spatial and temporal scales. This entails planning abstract subgoal sequences, visually reasoning about the underlying plans, and executing actions in accordance with the devised plan through visual-motor control. We propose *Compositional Foundation Models for Hierarchical Planning* (HiP), a foundation model which leverages multiple *expert* foundation model, trained *individually* on language, vision and action data, jointly together to solve long-horizon tasks. We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model. Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos. To enable effective reasoning within this hierarchy, we enforce consistency between the models via *iterative refinement*. We illustrate the efficacy and adaptability of our approach in three different long-horizon table-top manipulation tasks.

Submission Number: 56

Loading