BDQL: Offline RL via Behavior Diffusion Q-learning without Policy Constraint

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Offline Reinforcement Learning, Diffusion Policy
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose an on-policy style algorithm which has the potential to sovle the offline RL without policy constraint.
Abstract: Offline reinforcement learning (RL) algorithms often constrain the policy or regularize the value function within an off-policy actor-critic framework to overcome the overestimation on out-of-distribution (OOD) actions. And the on-policy style offline algorithms also cannot escape from these constraints (or regularization). In this paper, we propose an on-policy style algorithm, Behavior Diffusion Q-Learning (BDQL), which has the potential to solve offline RL without introducing any potential constraints. BDQL first recovers the behavior policy through the diffusion model and then updates this diffusion-based behavior policy using the behavior Q-function learned by SARSA. The update of BDQL exhibits a special two-stage pattern. At the beginning of the training, thanks to the precise modeling of the diffusion model, the on-policy guidance of the behavior Q-function over the behavior policy is effective enough to solve the offline RL. As training processes, BDQL suffers from the OOD issue, causing the training fluctuation or even collapse. Consequently, OOD issue arises after BDQL solves the offline problem which means the policy constraint is not necessary for solving offline RL in BDQL. Although the policy constraint can overcome the OOD issue and then completely address the training fluctuation, it also has a negative impact on solving the offline problem in the first stage. Therefore, we introduce the stochastic weight averaging (SWA) to mitigate the training fluctuation without affecting the offline solution. Experiments on D4RL demonstrate the special two-stage training phenomenon, where the first stage does have the capability to solve offline RL.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7361
Loading