Track: Research Track
Keywords: Bayesian optimization, policy search, Gaussian processes, temporal difference learning, regret bounds
Abstract: Bayesian optimization (BO) is a method commonly used for policy search in problems with low-dimensional policy parameterizations. While it is generally considered data-efficient, existing BO approaches are agnostic to the sequential structure of the optimization objective induced by policy roll-outs. Thereby, valuable information is discarded that could improve the convergence of BO.
We address this inefficiency by developing and rigorously analyzing a novel approach for BO that relies on a temporal difference learning formulation for discounted infinite-horizon value functions based on Gaussian process (GP) regression. We derive learning error bounds for the proposed temporal difference GPs, such that we can exploit upper confidence bounds to analyze the cumulative regret of our BO approach. This analysis is further refined by bounding the maximal information gain for our temporal difference GP model. In a comparison with a BO approach agnostic to the sequential structure and a reinforcement learning baseline on classic control benchmarks, we demonstrate the practical advantages of our method.
Submission Number: 73
Loading