- Keywords: reinforcement learning, off-policy, actor-critic, experience replay
- TL;DR: With a set of modifications, under 10 LOC, to A2C you get an off-policy actor-critic that outperforms A2C and performs similarly to ACER. The modifications are large batchsizes, aggressive clamping, and policy "forcing" with gumbel noise.
- Abstract: This paper presents GumbelClip, a set of modifications to the actor-critic algorithm, for off-policy reinforcement learning. GumbelClip uses the concepts of truncated importance sampling along with additive noise to produce a loss function enabling the use of off-policy samples. The modified algorithm achieves an increase in convergence speed and sample efficiency compared to on-policy algorithms and is competitive with existing off-policy policy gradient methods while being significantly simpler to implement. The effectiveness of GumbelClip is demonstrated against existing on-policy and off-policy actor-critic algorithms on a subset of the Atari domain.