Open Peer Review. Open Publishing. Open Access. Open Discussion. Open Directory. Open Recommendations. Open API. Open Source.
Policy Optimization with Second-Order Advantage Information
Jiajin Li, Baoxiang Wang
Feb 12, 2018 (modified: Feb 13, 2018)ICLR 2018 Workshop Submissionreaders: everyone
Abstract:Policy optimization on high-dimensional action spaces exhibits its difficulty caused by the high variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the Rao-Blackwell theorem (RB) and Control Variates (CV) into a unified framework to reduce the variance. To invoke RB, the algorithm learns the underlying factorization structure among the action space based on the second-order gradient of the advantage function with respect to the action. Empirical studies demonstrate the performance improvement on high-dimensional synthetic settings and OpenAI Gym's MuJoCo continuous control tasks.
TL;DR:A novel policy gradient estimator incorporating both Rao-Blackwell theorem and Control Variates into a unified framework.
Keywords:Policy gradient, Variance Reduction, Control Variates, Rao-Blackwellization
Enter your feedback below and we'll get back to you as soon as possible.