Keywords: off-policy learning, positive definite, convergence analysis, temporal difference learning, linear function approximtion
TL;DR: We proposed modified retrace to measure the off-policyness between the target policy and the behavior policy, and obtained a convergence guarantee.
Abstract: Off-policy learning is a key to extend reinforcement learning as it allows to learn a target policy from a different behavior policy that generates the data. However, it is well known as ``the deadly triad'' when combined with bootstrapping and function approximation. Retrace is an efficient and convergent off-policy algorithm with tabular value functions which employs truncated importance sampling ratios. Unfortunately, Retrace is known to be unstable with linear function approximation. In this paper, we propose modified Retrace to correct the off-policy return, derive a new off-policy temporal difference learning algorithm (TD-MRetrace) with linear function approximation, and obtain a convergence guarantee under standard assumptions in both prediction and control cases. Experimental results on counterexamples and control tasks validate the effectiveness of the proposed algorithm compared with traditional algorithms.
Other Supplementary Material: zip