Ensemble Policy Optimization with Diversity-regularized Exploration

TMLR Paper45 Authors

11 Apr 2022 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: In machine learning tasks, ensemble methods have been widely adopted to boost the performance by aggregating multiple learning models. However, ensemble methods are much less explored in the task of reinforcement learning, where most of previous works only combine multiple value estimators or dynamics models and use a mixed policy to explore the environment. In this work, we propose a simple yet effective ensemble policy optimization method to improve the joint performance of the policy ensemble. This method utilizes a policy ensemble where heterogeneous policies explore the environment collectively and their diversity is maintained by the proposed diversity regularization mechanism. We evaluate the proposed method on continuous control tasks and find that by aggregating the learned policies into an ensemble policy in test time, the performance is greatly improved. DEPO has performance improvement and faster convergence over the base on-policy single-agent method it built upon. Code will be made publicly available.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Martha_White1
Submission Number: 45
Loading