A Gradient Critic for Policy Gradient Estimation

Samuele Tosatto; Andrew Patterson; Martha White; A. Rupam Mahmood

A Gradient Critic for Policy Gradient Estimation

Samuele Tosatto, Andrew Patterson, Martha White, A. Rupam Mahmood

Published: 20 Jul 2023, Last Modified: 29 Aug 2023EWRL16Readers: Everyone

Keywords: temporal-difference, policy-gradient, actor-critic, semi-gradient

Abstract: The policy gradient theorem (Sutton et al., 2000) prescribes the usage of the on-policy state distribution to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a \textsl{gradient critic}, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that side-steps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy. We empirically show that our technique achieves a superior bias-variance trade-off and performance in the presence of off-policy samples. The extended version of this work can be found in Tosatto et al. (2022), and the implementation of the experiment at github.com/SamuelePolimi/temporal-difference-gradient.

Already Accepted Paper At Another Venue: already accepted somewhere else

1 Reply

Loading