Offline RL for Online RL: Decoupled Policy Learning for Mitigating Exploration Bias

Max Sobol Mark; Archit Sharma; Fahim Tajwar; Rafael Rafailov; Sergey Levine; Chelsea Finn

Offline RL for Online RL: Decoupled Policy Learning for Mitigating Exploration Bias

Max Sobol Mark, Archit Sharma, Fahim Tajwar, Rafael Rafailov, Sergey Levine, Chelsea Finn

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: reinforcement learning, offline reinforcement learning, exploration, fine-tuning

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: It is desirable for policies to optimistically explore new states and behaviors during online reinforcement learning (RL) or fine-tuning, especially when any prior offline data does not provide enough state coverage. However, exploration bonuses can bias the learned policy, and our experiments find that na\"ive, yet standard use of such bonuses can fail to recover a performant policy. Concurrently, pessimistic training in offline RL has enabled recovery of performant policies from static datasets. Can we leverage offline RL to recover better policies from online interaction? We make a simple observation that a policy can be trained from scratch on all interaction data with pessimistic objectives, thereby decoupling the policies used for data collection and for evaluation. Specifically, we propose the Offline-to-Online-to-Offline (OOO) framework for reinforcement learning (RL), where an optimistic (_exploration_) policy is used to interact with the environment, and a _separate_ pessimistic (_exploitation_) policy is trained on all the observed data for evaluation. Such decoupling can reduce any bias from online interaction (intrinsic rewards, primacy bias) in the evaluation policy, and can allow more exploratory behaviors during online interaction which in turn can generate better data for exploitation. OOO is complementary to several offline-to-online RL and online RL methods, and improves their average performance by 14\% to 26\% in our fine-tuning experiments, achieves state-of-the-art performance on several environments in the D4RL benchmarks, and also improves online RL performance by 165\% on two OpenAI gym environments. Further, OOO RL can enable fine-tuning from incomplete offline datasets where prior methods can fail to recover a performant policy.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7042

Loading