Advantage-Conditioned Diffusion: Offline RL via Generalization

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: offline RL, diffusion models
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We show how one can train a performant policy with conditional diffusion models without maximizing a critic
Abstract: Reinforcement learning algorithms typically involve an explicit maximization step somewhere in the process. For example, policy gradient methods maximize an estimate of the expected return, and TD methods maximize the target value while training a critic network. However, explicit maximization of neural function approximators leads to learning out-of-distribution actions during offline training, which in turn can lead to overestimation and distributional shift of the learned policy. Can we instead devise an offline RL method that maximizes the value implicitly, via generalization? In this paper, we show how expressive conditional generative models combined with implicit Q-learning backups can enable this, providing an offline RL method that attains good results through generalization alone, and state-of-the-art results when combined with a simple filtering step that maximizes over samples from the policy only at evaluation time. We believe that our work provides evidence that the next big advancements in offline RL will involve powerful generative models.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4104
Loading