- Keywords: off-policy RL, offline Learning
- Abstract: Recent studies show the promising results of using online RL methods in the offline setting. However, such a learning diagram may suffer from an overtraining issue, that is, the performance of the policy degrades significantly as the training process continues when the dataset is not sufficiently large and diverse. In this work, we propose an alternative approach to alleviate and avoid the overtraining issue: we explicitly take the learning stability into account in the policy learning objective, and adaptively select a good policy before the overtraining issue happens. To do so, we develop an Uncertainty Regularized Policy Learning (URPL) method. URPL adds an uncertainty regularization term in the policy learning objective to enforce to learn a more stable policy under the offline setting. Moreover, we further use the uncertainty regularization term as a surrogate metric indicating the potential performance of a policy. Based on the low-valued region of the uncertainty term, we can select a good policy with considerable good performance and low computation requirements. On standard offline RL benchmark D4RL, URPL achieves much better final performance over existing state-of-the-art baselines.
- One-sentence Summary: A simple method for targeting the overtraining issue when applying off-policy RL methods to offline learning