An efficient reinforcement learning algorithm for learning deterministic policies in continuous domains

Matthieu Zimmer, Paul Weng

Published: 2019, Last Modified: 21 Jul 2025DAI 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we present an improvement to an existing reinforcement learning algorithm that can learn very efficiently deterministic policies in continuous domains. It builds on two recently-proposed techniques. First, it can be seen as a variation of an actor-critic algorithm, called Penalized Neural-Fitted Actor Critic (PeNFAC) [24], which showed excellent experimental performance in the Roboschool environments. Second, it incorporates a better estimate for the value function of the current policy, called V-trace target [3], by allowing the reuse of off-policy data generated by recent previous policies. We experimentally compare two different implementations of V-trace: one based on n-step returns and the other on λ-returns. Finally, we show that our proposed algorithm can outperform several state-of-the-art algorithms (TD3, DDPG, PPO, PeNFAC, NFAC) over three environments of the Roboschool benchmark (Hopper, HalfCheetah, Humanoid).