GumbelClip: Off-Policy Actor-Critic Using Experience Replay

Norman Tasfi; Miriam Capretz

GumbelClip: Off-Policy Actor-Critic Using Experience Replay

Norman Tasfi, Miriam Capretz

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Withdrawn SubmissionReaders: Everyone

Keywords: reinforcement learning, off-policy, actor-critic, experience replay

TL;DR: With a set of modifications, under 10 LOC, to A2C you get an off-policy actor-critic that outperforms A2C and performs similarly to ACER. The modifications are large batchsizes, aggressive clamping, and policy "forcing" with gumbel noise.

Abstract: This paper presents GumbelClip, a set of modifications to the actor-critic algorithm, for off-policy reinforcement learning. GumbelClip uses the concepts of truncated importance sampling along with additive noise to produce a loss function enabling the use of off-policy samples. The modified algorithm achieves an increase in convergence speed and sample efficiency compared to on-policy algorithms and is competitive with existing off-policy policy gradient methods while being significantly simpler to implement. The effectiveness of GumbelClip is demonstrated against existing on-policy and off-policy actor-critic algorithms on a subset of the Atari domain.

Original Pdf: pdf

5 Replies

Loading