Keywords: temporal difference learning, multi-step learning, advantage function, control variate, bias variance tradeoff
Abstract: We analyze the finite-sample variance of temporal difference (TD) learning in the phased TD setting, and show that one of the mechanisms behind bootstrapping's ability to reduce variance is by effectively aggregating over a larger number of independent trajectories.
Based on this insight, we demonstrate that asymptotically, the variance of TD learning is bounded from above by Monte-Carlo (MC) estimators.
In addition, we draw connections to Direct Advantage Estimation (DAE), a method for estimating the advantage function, and show that it can be seen as a type of regression-adjusted control variate, which further reduces the variance of TD.
Finally, we illustrate the asymptotic behaviors of these estimators empirically with carefully designed environments.
Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.
Serve As Reviewer: ~Hsiao-Ru_Pan1
Track: Regular Track: unpublished work
Submission Number: 5
Loading