Keywords: Reinforcement Learning, Offline Reinforcement Learing, Value Regularisation
Abstract: In this paper, we propose a new framework for value regularisation for offline reinforcement learning (RL). While most previous methods evade explicit out-of-distribution (OOD) region identification due to its difficulty, our method explicitly identifies the OOD region, which can be non-convex depending on datasets, via a newly proposed trajectory clustering-based behaviour cloning algorithm. With the obtained explicit OOD region, we then define a Bellman-type operator pushing the value in the OOD region to a tight lower bound while operating normally in the in-distribution region. The value function with this operator can be used for policy acquisition in various ways. Empirical results on multiple offline RL benchmarks show that our method yields the state-of-the-art performance.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 18463
Loading