On the Variance of Temporal Difference Learning and its Reduction Using Control Variates

Published: 17 Jul 2025, Last Modified: 06 Sept 2025EWRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: temporal difference learning, multi-step learning, advantage function, control variate, bias variance tradeoff
Abstract: We analyze the finite-sample variance of temporal difference (TD) learning in the phased TD setting, and show that one of the mechanisms behind bootstrapping's ability to reduce variance is by effectively aggregating over a larger number of independent trajectories. Based on this insight, we demonstrate that asymptotically, the variance of TD learning is bounded from above by Monte-Carlo (MC) estimators. In addition, we draw connections to Direct Advantage Estimation (DAE), a method for estimating the advantage function, and show that it can be seen as a type of regression-adjusted control variate, which further reduces the variance of TD. Finally, we illustrate the asymptotic behaviors of these estimators empirically with carefully designed environments.
Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.
Serve As Reviewer: ~Hsiao-Ru_Pan1
Track: Regular Track: unpublished work
Submission Number: 5
Loading