Reinforcement Learning of Diverse Skills using Mixture of Deep Experts

Onur Celik; Aleksandar Taranovic; Gerhard Neumann

Reinforcement Learning of Diverse Skills using Mixture of Deep Experts

Onur Celik, Aleksandar Taranovic, Gerhard Neumann

Published: 20 Oct 2023, Last Modified: 30 Nov 2023IMOL@NeurIPS2023EveryoneRevisionsBibTeX

Keywords: Diverse Skill Learning, Automatic Curriculum Learning, Reinforcement Learning, Mixture of Experts

TL;DR: We propose a method that allows to learn diverse skills to the same task defined by a context by leveraging mixture of experts policies.

Abstract: Agents that can acquire diverse skills to solve the same task have a benefit over other agents if e.g. unexpected environmental changes occur. However, Reinforcement Learning (RL) policies mainly rely on Gaussian parameterization, preventing them from learning multi-modal, diverse skills. In this work, we propose a novel RL approach for training policies that exhibit diverse behavior. To this end, we propose a highly non-linear Mixture of Experts (MoE) as the policy representation, where each expert formalizes a skill as a contextual motion primitive. The context defines the task, which can be for instance the goal reaching position of the agent, or changing physical parameters like friction. Given a context, our trained policy first selects an expert out of the repertoire of skills and subsequently adapts the parameters of the contextual motion primitive. To incentivize our policy to learn diverse skills, we leverage a maximum entropy objective combined with a per-expert context distribution that we optimize alongside each expert. The per-expert context distribution allows each expert to focus on a context sub-space and boost learning speed. However, these distributions need to be able to represent multi-modality and hard discontinuities in the environment's context probability space. We solve these requirements by leveraging energy-based models to represent the per-expert context distributions and show how we can efficiently train them using the standard policy gradient objective.

Submission Number: 18

Loading