Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

Published: 05 Apr 2024, Last Modified: 01 May 2024VLMNM 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Representation Learning for Control, Diffusion Models, Foundation Models
TL;DR: We investigate representations from pre-trained text-to-image diffusion models for control tasks and showcase competitive performance across a wide range of tasks.
Abstract: Learning general agents that can tackle any language-specified embodied task as humans can requires a broad understanding of the world through text and visual inputs. Such capabilities are difficult to learn solely from task-specific data and as such, pre-trained vision-language models have emerged as powerful tools that enable the effective transfer of representations learned from internet-scale data to new domains and downstream tasks. However, many previously used representations are limited by the upstream task the pre-trained model is designed for, and may not contain useful features for fine-grained spatial understanding of a scene, which are vital for control. To address this issue, we consider pretrained text-to-image diffusion models to construct Stable Control Representations which allow learning downstream control policies that generalize to complex, open-ended environments. We show that policies learned with Stable Control Representations are competitive on an extensive range of simulated control benchmarks and exhibit strong performance on difficult control tasks that require generalization to unseen objects at test time. Most notably, we show that Stable Control Representations enable learning policies with state-of-the-art performance on a challenging open-vocabulary navigation benchmark.
Submission Number: 21
Loading