Schedule Based Temporal Difference Algorithms

Rohan Deb; Meet Gandhi; Shalabh Bhatnagar

Schedule Based Temporal Difference Algorithms

Rohan Deb, Meet Gandhi, Shalabh Bhatnagar

Published: 01 Jan 2022, Last Modified: 03 Oct 2024Allerton 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Learning the value function of a given policy from data samples is an important problem in Reinforcement Learning. TD $(\lambda)$ is a popular class of algorithms to solve this problem. However, the weights assigned to different $n$ -step returns in TD $(\lambda)$ , controlled by the parameter $\lambda$ , decrease exponentially with increasing $n$ . In this paper, we present a $\lambda$ -schedule procedure that generalizes the TD $(\lambda)$ algorithm to the case when the parameter $\lambda$ could vary with time-step. This allows flexibility in weight assignment, i.e., the user can specify the weights assigned to different $n$ -step returns by choosing a sequence $\{\lambda_{t}\}_{t\geq 1}$ . Based on this procedure, we propose an on-policy algorithm - TD $(\lambda)\text{-}$ schedule, and two off-policy algorithms - GTD $(\lambda)$ -schedule and TDC $(\lambda)$ -schedule, respectively. We provide proofs of almost sure convergence for all three algorithms under a general Markov noise framework.

Loading