Efficient Offline Reinforcement Learning: The Critic is Critical

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Offline reinforcement learning, reinforcement learning, efficiency, temporal-difference learning, value error, Bellman error
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: Supervised pre-training offline reinforcement learning algorithms (crucially the critic) leads to more efficient subsequent temporal difference learning.
Abstract: Recent work has demonstrated both benefits and limitations from using supervised approaches (without temporal-difference learning) for offline reinforcement learning. While off-policy reinforcement learning provides a promising approach for improving performance beyond supervised approaches, we observe that training is often inefficient and unstable due to temporal difference bootstrapping. In this paper we propose a best-of-both approach by first learning the behaviour policy and critic with supervised learning, before improving with off-policy reinforcement learning. Crucially, we demonstrate that the critic can be learned by pre-training with a supervised Monte-Carlo value-error, making use of commonly neglected downstream information from the provided offline trajectories. This provides consistent initial values for efficient improvement with temporal difference learning. We further generalise our approach to entropy-regularised reinforcement learning and apply our proposed pre-training to state-of-the-art hard and soft off-policy algorithms. We find that we are able to more than halve the training time of the considered offline algorithms on standard benchmarks, and surprisingly also achieve greater stability. We further build on our insight into the importance of having consistent policy and value functions to propose novel hybrid algorithms that regularise both the actor and the critic towards the behaviour policy. This maintains the benefits of pre-training when learning from limited human demonstrations.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8212
Loading