Sample-Efficient Reinforcement Learning for Continuous Actions with Continuous BDPI

Denis Steckelmacher, Hélène Plisnier, Ann Nowé

06 Apr 2021OpenReview Archive Direct UploadReaders: Everyone

Abstract: Bootstrapped Dual Policy Iteration is a model-free Reinforcement Learning algorithm that combines several off-policy critics with an actor robust to off-policy critics. The critics are trained using a variant of Q-Learning, and the actor imitates their average greedy policies with Conservative Policy Iteration. BDPI achieves state-of-the-art sample-efficiency in discrete-action do- mains, but is inapplicable to continuous-action domains, as both the actor and critic update rules rely on the ability to enumerate the actions. In this paper, we present a novel implementation of the BDPI ideas, off-policy critics and an actor, for continuous actions. Our actor is built around a single discriminator network, easy to train to imitate greedy policies, and the critics take inspiration from the off-line batch RL literature to allow off-policy learning. In this early work (visionary) paper, we show that our Continuous BDPI is several times more sample-efficient than the Soft Actor-Critic on BipedalWalker, using naive hyper-parameters. We explain in our experimental section how future work will allow to characterize the behavior of Continuous BDPI regarding its hyper-parameters, which will allow it to be applied to more environments.

0 Replies