Policy Search via Bayesian Optimization with Temporal Difference Gaussian Processes

Armin Lederer; Anuj Srivastava; Andreas Krause

Policy Search via Bayesian Optimization with Temporal Difference Gaussian Processes

Armin Lederer, Anuj Srivastava, Andreas Krause

Published: 23 Sept 2025, Last Modified: 01 Dec 2025ARLETEveryoneRevisionsBibTeXCC BY 4.0

Track: Research Track

Keywords: Bayesian optimization, policy search, Gaussian processes, temporal difference learning, regret bounds

Abstract: Bayesian optimization (BO) is a method commonly used for policy search in problems with low-dimensional policy parameterizations. While it is generally considered data-efficient, existing BO approaches are agnostic to the sequential structure of the optimization objective induced by policy roll-outs. Thereby, valuable information is discarded that could improve the convergence of BO. We address this inefficiency by developing and rigorously analyzing a novel approach for BO that relies on a temporal difference learning formulation for discounted infinite-horizon value functions based on Gaussian process (GP) regression. We derive learning error bounds for the proposed temporal difference GPs, such that we can exploit upper confidence bounds to analyze the cumulative regret of our BO approach. This analysis is further refined by bounding the maximal information gain for our temporal difference GP model. In a comparison with a BO approach agnostic to the sequential structure and a reinforcement learning baseline on classic control benchmarks, we demonstrate the practical advantages of our method.

Submission Number: 73

Loading