Ensemble Policy Optimization with Diversity Regularization

TMLR Paper290 Authors

20 Jul 2022 (modified: 28 Feb 2023)Rejected by TMLREveryoneRevisionsBibTeX
Abstract: In machine learning tasks, ensemble methods have been widely adopted to boost the performance by aggregating multiple learning models. However, ensemble methods are much less explored in the task of reinforcement learning, where most of previous works only combine multiple value estimators or dynamics models and use a mixed policy to explore the environment. In this work, we propose a simple yet effective ensemble policy optimization method to improve the joint performance of the policy ensemble. This method utilizes a policy ensemble where heterogeneous policies explore the environment collectively and their diversity is maintained by the proposed diversity regularization mechanism. We evaluate the proposed method on continuous control tasks and find that by aggregating the learned policies into an ensemble policy in test time, the performance is greatly improved. \revision{DEPO has performance improvement and faster convergence over the base on-policy single-agent method it built upon.} Code will be made publicly available.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=lWeSoudtUA
Changes Since Last Submission: With many thanks to the action editor Martha White and reviewers in last submission of this work, we revised the paper in the following aspects: * We removed the off-policy version of our method since it creates confusion and lacks of convincing supports on the claims. In last version, a premature algorithm off-policy DEPO is presented, but not thoroughly discussed and fairly compared. For completeness and correctness, we remove relevant sections. * A new section "Problem Formulation" (Sec. 3.1) was added to address action editor's concern on the clarity of problem setting. * To avoid over-claiming, we also revised the statements and claims in introduction and conclusion to narrow down the improvement brought by our method only on on-policy setting, instead of on arbitrary RL settings. * As suggested by AE, we remove the term "exploration" in title and relevant discussion since our primal contribution is a policy ensemble training algorithm, instead of explicitly encouraging exploration. Please find previous reviews at: https://openreview.net/forum?id=lWeSoudtUA
Assigned Action Editor: ~Martha_White1
Submission Number: 290
Loading