Behavior Proximal Policy Optimization Download PDF

Anonymous

22 Sept 2022, 12:42 (modified: 17 Nov 2022, 03:16)ICLR 2023 Conference Blind SubmissionReaders: Everyone
Keywords: Offline Reinforcement Learning, Monotonic Policy Improvement
TL;DR: We propose Behavior Proximal Policy Optimization (BPPO), which bases on on-policy method (PPO) and effectively solves offline RL without any extra constraint or regularization introduced.
Abstract: Offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly due to the overestimation of out-of-distribution actions. Thus, various additional augmentations are proposed to keep the learned policy close to the offline dataset (or behavior policy). In this work, starting from the analysis of offline monotonic policy improvement, we get a surprising finding that some online on-policy algorithms are naturally able to solve offline RL. Specifically, the inherent conservatism of these on-policy algorithms is exactly what the offline RL method needs to accomplish the closeness. Based on this, we design an algorithm called Behavior Proximal Policy Optimization (BPPO), which successfully solves offline RL without any extra constraint or regularization introduced. Extensive experiments on the D4RL benchmark indicate this extremely succinct method outperforms state-of-the-art offline RL algorithms.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Reinforcement Learning (eg, decision and control, planning, hierarchical RL, robotics)
16 Replies

Loading