On the Variance of Temporal Difference Learning and its Reduction Using Control Variates

Hsiao-Ru Pan; Bernhard Schölkopf

On the Variance of Temporal Difference Learning and its Reduction Using Control Variates

Hsiao-Ru Pan, Bernhard Schölkopf

Published: 17 Jul 2025, Last Modified: 07 Oct 2025EWRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: temporal difference learning, multi-step learning, advantage function, control variate, bias variance tradeoff

Abstract: We analyze the finite-sample variance of temporal difference (TD) learning in the phased TD setting, and show that one of the mechanisms behind bootstrapping's ability to reduce variance is by effectively aggregating over a larger number of independent trajectories. Based on this insight, we demonstrate that asymptotically, the variance of TD learning is bounded from above by Monte-Carlo (MC) estimators. In addition, we draw connections to Direct Advantage Estimation (DAE), a method for estimating the advantage function, and show that it can be seen as a type of regression-adjusted control variate, which further reduces the variance of TD. Finally, we illustrate the asymptotic behaviors of these estimators empirically with carefully designed environments.

Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.

Serve As Reviewer: ~Hsiao-Ru_Pan1

Track: Regular Track: unpublished work

Submission Number: 5

Loading