The Mirage of Action-Dependent Baselines in Reinforcement Learning

George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard E. Turner, Zoubin Ghahramani, Sergey Levine

Feb 12, 2018 (modified: Jun 04, 2018) ICLR 2018 Workshop Submission readers: everyone Show Bibtex
  • Abstract: Model-free reinforcement learning with flexible function approximators has shown success in goal-directed sequential decision-making problems. Policy gradient methods are a widely used class of stable model-free algorithms and typically, a state-dependent baseline or control variate is necessary to reduce the gradient estimator variance. Several recent papers extend the baseline to depend on both the state and action, and suggest that this enables significant variance reduction and improved sample efficiency without introducing bias into the gradient estimates. To better understand this development, we decompose the variance of the policy gradient estimator and numerically show that learned state-action-dependent baselines do not in fact reduce variance over a state-dependent baseline in the commonly tested benchmark domains. We confirm this unexpected result by reviewing the open-source code accompanying these prior papers, and show that subtle implementation decisions cause deviations from the methods presented in the papers and explain the sources of the previously observed empirical gains.
  • Keywords: reinforcement learning, action-dependent baseline, variance reduction, policy gradient
  • TL;DR: We decompose the variance of the policy gradient estimator and numerically show that learned state-action-dependent baselines do not in fact reduce variance over a state-dependent baseline in the commonly tested benchmark domains.

Loading