Reinforcement Learning of Diverse Skills using Mixture of Deep Experts

Onur Celik; Gerhard Neumann

Reinforcement Learning of Diverse Skills using Mixture of Deep Experts

Onur Celik, Gerhard Neumann

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Diverse Skill Learning, black box reinforcement learning, versatile skill learning, mixture of experts policy

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: We propose a method that enables agents to acquire a diverse and versatile set of skill by leveraging mixture of experts policy and deep black box reinforcement learning.

Abstract: Agents that can acquire diverse skills to solve the same task have a benefit over other agents. Unexpected environmental changes for example may prohibit executing a learned behavior such that a complete retraining is necessary if the agent can not discard the invalid skill and rely on previously acquired, different ones. However, Reinforcement Learning (RL) policies mainly rely on Gaussian parameterization, preventing them from learning multi-modal, diverse skills. In this work, we propose a novel RL approach for training policies that exhibit diverse behavior. To this end, we propose a highly non-linear Mixture of Experts (MoE) as the policy representation, where each expert formalizes a skill as a contextual motion primitive. The context defines the task, which can be for instance the goal reaching position of the agent, or changing physical parameters like friction. Given a context, our trained policy first selects an expert out of the repertoire of skills and subsequently adapts the parameters of the contextual motion primitive. To incentivize our policy to learn diverse skills, we leverage a maximum entropy objective combined with a per-expert context distribution that we optimize alongside each expert. The per-expert context distribution allows each expert to focus on a context sub-space and boost learning speed. However, these distributions need to be able to represent multi-modality and hard discontinuities in the environment's context probability space. Moreover, the distributions should not rely on environmental pre-knowledge such as context boundaries, as they are usually not given. We solve these requirements by leveraging energy-based models to represent the per-expert context distributions and show how we can efficiently train them using the standard policy gradient objective. We show that our approach can learn precise and diverse skills of challenging robot simulation tasks.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5988

Loading