POLTER: Policy Trajectory Ensemble Regularization for Unsupervised Reinforcement Learning
Abstract: The goal of Unsupervised Reinforcement Learning (URL) is to find a reward-agnostic prior policy on a task domain, such that the sample-efficiency on supervised downstream tasks is improved. Although agents initialized with such a prior policy can achieve a significantly higher reward with fewer samples when finetuned on the downstream task, it is still an open question how an optimal pretrained prior policy can be achieved in practice. In this work, we present POLTER (Policy Trajectory Ensemble Regularization) – a general method to regularize the pretraining that can be applied to any URL algorithm and is especially useful on data- and knowledge-based URL algorithms. It utilizes an ensemble of policies that are discovered during pretraining and moves the policy of the URL algorithm closer to its optimal prior. Our method is based on a theoretical framework, and we analyze its practical effects on a white-box benchmark, allowing us to study POLTER with full control. In our main experiments, we evaluate POLTER on the Unsupervised Reinforcement Learning Benchmark (URLB), which consists of 12 tasks in 3 domains. We demonstrate the generality of our approach by improving the performance of a diverse set of data- and knowledge-based URL algorithms by 19% on average and up to 40% in the best case. Under a fair comparison with tuned baselines and tuned POLTER, we establish a new state-of-the-art for model-free methods on the URLB.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We applied the following changes to the manuscript: - Extended the related work Section 2.2 to include more recent and relevant papers. - Substantially more detailed theoretical derivation of the method in Section 3.1 and 3.2. - Additional gradient variance experiment in Section 5 (Q3) to analyze the specific effect of our method during pretraining of ProtoRL and RND. - Hyperparameter sensitvity analysis of the checkpoint schedule in Section 5 (Q5). - Evaluation on pixel-based URLB in Section 5 (Q6). - Note on the evaluation of ProtoRL+POLTER with non-uniform ensemble mixture components in Section 6. - Experiment on the distance between the pretraining policy of ProtoRL in the PointMass domain in Appendix A.
Supplementary Material: zip
Assigned Action Editor: ~Lihong_Li1
Submission Number: 728