Dynamically Balanced Value Estimates for Actor-Critic MethodsDownload PDF

25 Sep 2019 (modified: 24 Dec 2019)ICLR 2020 Conference Withdrawn SubmissionReaders: Everyone
  • Original Pdf: pdf
  • Keywords: Reinforcement Learning, Actor-Critic, Continuous Control
  • TL;DR: A method for more accurate critic estimates in reinforcement learning.
  • Abstract: Reinforcement learning in an actor-critic setting relies on accurate value estimates of the critic. However, the combination of function approximation, temporal difference (TD) learning and off-policy training can lead to an overestimating value function. A solution is to use Clipped Double Q-learning (CDQ), which is used in the TD3 algorithm and computes the minimum of two critics in the TD-target. We show that CDQ induces an underestimation bias and propose a new algorithm that accounts for this by using a weighted average of the target from CDQ and the target coming from a single critic. The weighting parameter is adjusted during training such that the value estimates match the actual discounted return on the most recent episodes and by that it balances over- and underestimation. Empirically, we obtain more accurate value estimates and demonstrate state of the art results on several OpenAI gym tasks.
  • Code: https://gofile.io/?c=AQFK3j
