Variance Reduction for Policy Gradient Methods with Action-Dependent Baselines


Nov 03, 2017 (modified: Nov 03, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Policy gradient methods have enjoyed success in deep reinforcement learning but suffer from high variance of gradient estimates. The high variance problem is particularly exasperated in problems with long horizons or high dimensional action spaces. To mitigate this issue, we derive an action-dependent baseline for variance reduction which fully exploits the structural form of the stochastic policy itself, and does not make any additional assumptions about the MDP. We demonstrate and quantify the benefit of the action-dependent baseline both through theoretical analysis as well as numerical results. Our experimental results indicate that action-dependent baselines allow for faster learning on standard reinforcement learning benchmarks as well as on high dimensional manipulation and multi-agent communication tasks.
  • TL;DR: Action-dependent baselines can be bias-free and yield greater variance reduction than state-only dependent baselines for policy gradient methods.
  • Keywords: reinforcement learning, policy gradient, variance reduction, baseline, control variates