Open Peer Review. Open Publishing. Open Access. Open Discussion. Open Directory. Open Recommendations. Open API. Open Source.
Variance Reduction for Policy Gradient Methods with Action-Dependent Baselines
Nov 03, 2017 (modified: Nov 03, 2017)ICLR 2018 Conference Blind Submissionreaders: everyoneShow Bibtex
Abstract:Policy gradient methods have enjoyed success in deep reinforcement learning but suffer from high variance of gradient estimates. The high variance problem is particularly exasperated in problems with long horizons or high dimensional action spaces. To mitigate this issue, we derive an action-dependent baseline for variance reduction which fully exploits the structural form of the stochastic policy itself, and does not make any additional assumptions about the MDP. We demonstrate and quantify the benefit of the action-dependent baseline both through theoretical analysis as well as numerical results. Our experimental results indicate that action-dependent baselines allow for faster learning on standard reinforcement learning benchmarks as well as on high dimensional manipulation and multi-agent communication tasks.
TL;DR:Action-dependent baselines can be bias-free and yield greater variance reduction than state-only dependent baselines for policy gradient methods.
Keywords:reinforcement learning, policy gradient, variance reduction, baseline, control variates
Enter your feedback below and we'll get back to you as soon as possible.