Keywords: meta-reinforcement learning, meta-gradient, reinforcement learning
TL;DR: This work makes the bias-variance tradeoff in meta-gradients due to truncation and sampling correction explicit and shows that DiCE and related methods always add bias.
Abstract: Meta-gradients provide a general approach for optimizing the meta-parameters of reinforcement learning (RL) algorithms. Estimation of meta-gradients is central to the performance of these meta-algorithms, and has been studied in the setting of MAML-style short-horizon meta-RL problems. In this context, prior work has investigated the estimation of the Hessian of the RL objective, as well as tackling the problem of credit assignment to pre-adaptation behavior by making a sampling correction. However, we show that Hessian estimation, implemented for example by DiCE and its variants, always add bias and can also add variance to meta-gradient estimation. DiCE-like approaches are therefore unlikely to lie on Pareto frontier of the bias-variance tradeoff and should not be pursued in the context of meta-gradients for RL. Meanwhile, the sampling correction has not been studied in the important long-horizon setting, where the inner optimization trajectories must be truncated for computational tractability. We study the bias and variance tradeoff induced by truncated backpropagation in combination with a weighted sampling correction. While prior work has implicitly chosen points in this bias-variance space, we disentangle the sources of bias and variance and present an empirical study which relates existing estimators to each other.