Keywords: Offline reinforcement learning
Abstract: In practice, due to the need for online interaction with the environment, deploying reinforcement learning (RL) agents in safety-critical scenarios is challenging. Offline RL offers the promise of directly leveraging large, previously collected datasets to acquire effective policies without further interaction. Although a number of offline RL algorithms have been proposed, their performance is generally limited by the quality of dataset. To address this problem, we propose an Adaptive Policy constrainT (AdaPT) method, which allows effective exploration on out-of-distribution actions by imposing an adaptive constraint on the learned policy. We theoretically show that AdaPT produces a tight upper bound on the distributional deviation between the learned policy and the behavior policy, and this upper bound is the minimum requirement to guarantee policy improvement at each iteration. Formally, we present a practical AdaPT augmented Actor-Critic (AdaPT-AC) algorithm, which is able to learn a generalizable policy even from the dataset that contains a large amount of random data and induces a bad behavior policy. The empirical results on a range of continuous control benchmark tasks demonstrate that AdaPT-AC substantially outperforms several popular algorithms in terms of both final performance and robustness on four datasets of different qualities.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Reviewed Version (pdf): https://openreview.net/references/pdf?id=Pp6rHWQ9aX
7 Replies
Loading