GumbelClip: Off-Policy Actor-Critic Using Experience ReplayDownload PDF

Norman Tasfi, Miriam Capretz

25 Sep 2019 (modified: 24 Dec 2019)ICLR 2020 Conference Withdrawn SubmissionReaders: Everyone
  • Original Pdf: pdf
  • Keywords: reinforcement learning, off-policy, actor-critic, experience replay
  • TL;DR: With a set of modifications, under 10 LOC, to A2C you get an off-policy actor-critic that outperforms A2C and performs similarly to ACER. The modifications are large batchsizes, aggressive clamping, and policy "forcing" with gumbel noise.
  • Abstract: This paper presents GumbelClip, a set of modifications to the actor-critic algorithm, for off-policy reinforcement learning. GumbelClip uses the concepts of truncated importance sampling along with additive noise to produce a loss function enabling the use of off-policy samples. The modified algorithm achieves an increase in convergence speed and sample efficiency compared to on-policy algorithms and is competitive with existing off-policy policy gradient methods while being significantly simpler to implement. The effectiveness of GumbelClip is demonstrated against existing on-policy and off-policy actor-critic algorithms on a subset of the Atari domain.
5 Replies