Text-Aware Diffusion Policies

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: diffusion models, reinforcement learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: Pretrained text-to-image diffusion models enable the learning of reinforcement learning agents aligned with text conditioning.
Abstract: Diffusion models scaled to massive datasets have demonstrated powerful underlying unification capabilities between the language modality and pixel space, as convincingly evidenced by high-quality text-to-image synthesis that delight and astound. In this work, we interpret agents interacting within a visual reinforcement learning setting as trainable video renderers, where the output video is simply frames stitched together across sequential timesteps. Then, we propose Text-Aware Diffusion Policies (TADPols), which uses large-scale pretrained models, particularly text-to-image diffusion models, to train policies that are aligned with natural language text inputs. As the behavior represented within a policy naturally learns to align with the reward function utilized during optimization, we propose generating the reward signal for a reinforcement learning agent as the similarity between a provided text description and the frames the agent produces from its interactions. Furthermore, rendering the video produced by an agent during inference can be treated as a form of text-to-video generation, where the video has the added bonus of always being smooth and consistent with respect to the environmental specifications. Additionally, when the diffusion model is kept frozen, this enables the investigation of how well a large-scale model pretrained only on static image and textual data is able to understand temporally extended behaviors and actions. We conduct experiments on a variety of locomotion experiments across multiple subjects, and demonstrate that agents can be trained using the unified understanding of vision and language captured within large-scale pretrained diffusion models to not only synthesize videos that correspond with provided text, but also learn to perform the motion itself as autonomous agents.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5835
Loading