Keywords: off-policy policy evaluation, average-reward
Abstract: Off-policy policy evaluation has been a longstanding problem in reinforcement learning. This paper looks at this problem under the average-reward formulation with function approximation. Differential temporal-difference (TD) learning has been proposed recently and has shown great potential compared to previous average-reward learning algorithms. In the tabular setting, off-policy differential TD is guaranteed to converge. However, the convergence guarantee cannot be carried through the function approximation setting. To address the instability of off-policy differential TD, we investigate the emphatic approach proposed for the discounted formulation. Specifically, we introduce average emphatic trace for average-reward off-policy learning. We further show that without any variance reduction techniques, the new trace suffers from slow learning due to high variance of importance sampling ratios. Finally, we show that differential emphatic TD($\beta$), extended from the discounted setting, can save us from the high variance while introducing bias. Experimental results on a counterexample show that differential emphatic TD($\beta$) performs better than an existing competitive off-policy algorithm.
TL;DR: Investigates the emphatic approach for average-reward off-policy policy evaluation with function approximation.
0 Replies
Loading