Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning

Sumeet Batra; Bryon Tjanaka; Matthew Christopher Fontaine; Aleksei Petrenko; Stefanos Nikolaidis; Gaurav S. Sukhatme

Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning

Sumeet Batra, Bryon Tjanaka, Matthew Christopher Fontaine, Aleksei Petrenko, Stefanos Nikolaidis, Gaurav S. Sukhatme

Published: 16 Jan 2024, Last Modified: 22 Mar 2024ICLR 2024 spotlightEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Reinforcement Learning, Quality Diversity, Robotics, Machine Learning, Evolution Strategies

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We present a novel QD-RL method that leverages on-policy RL and Differentiable Quality Diversity to discover a variety of high performing locomtion gaits on the challenging mujoco environment, including, for the first time in QD-RL, humanoid.

Abstract:

Training generally capable agents that thoroughly explore their environment and learn new and diverse skills is a long-term goal of robot learning. Quality Diversity Reinforcement Learning (QD-RL) is an emerging research area that blends the best aspects of both fields – Quality Diversity (QD) provides a principled form of exploration and produces collections of behaviorally diverse agents, while Reinforcement Learning (RL) provides a powerful performance improvement operator enabling generalization across tasks and dynamic environments. Existing QD-RL approaches have been constrained to sample efficient, deterministic off- policy RL algorithms and/or evolution strategies and struggle with highly stochastic environments. In this work, we, for the first time, adapt on-policy RL, specifically Proximal Policy Optimization (PPO), to the Differentiable Quality Diversity (DQD) framework and propose several changes that enable efficient optimization and discovery of novel skills on high-dimensional, stochastic robotics tasks. Our new algorithm, Proximal Policy Gradient Arborescence (PPGA), achieves state-of- the-art results, including a 4x improvement in best reward over baselines on the challenging humanoid domain.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Primary Area: reinforcement learning

Submission Number: 8192

Loading