Abstract: Though widely useful in reinforcement learning, ``semi-gradient" methods---including TD($\lambda$) and Q-learning---do not converge as robustly as gradient-based methods. Even in the case of linear function approximation, convergence cannot be guaranteed for these methods when they are used with off-policy training, in which an agent uses a behavior policy that differs from the target policy in order to gain experience for learning. To address this, alternative algorithms that are provably convergent in such cases have been introduced, the most well known being gradient descent temporal difference (GTD) learning. This algorithm and others like it, however, tend to converge much more slowly than conventional temporal difference learning. In this paper we propose gradient descent temporal difference-difference (Gradient-DD) learning in order to improve GTD2, a GTD algorithm, by introducing second-order differences in successive parameter updates. We investigate this algorithm in the framework of linear value function approximation, theoretically proving its convergence by applying the theory of stochastic approximation. Studying the model empirically on the random walk task, the Boyan-chain task, and the Baird's off-policy counterexample, we find substantial improvement over GTD2 and, in several cases, better performance even than conventional TD learning.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Gergely_Neu1
Submission Number: 1606
Loading