TL;DR: We present a novel approach for the off-policy estimation problem in infinite-horizon RL.
Abstract: Off-policy estimation for long-horizon problems is important in many real-life applications such as healthcare and robotics, where high-fidelity simulators may not be available and on-policy evaluation is expensive or impossible. Recently, \citet{liu18breaking} proposed an approach that avoids the curse of horizon suffered by typical importance-sampling-based methods. While showing promising results, this approach is limited in practice as it requires data being collected by a known behavior policy. In this work, we propose a novel approach that eliminates such limitations. In particular, we formulate the problem as solving for the fixed point of a "backward flow" operator and show that the fixed point solution gives the desired importance ratios of stationary distributions between the target and behavior policies. We analyze its asymptotic consistency and finite-sample
generalization. Experiments on benchmarks verify the effectiveness of our proposed approach.
Keywords: reinforcement learning, off-policy estimation, importance sampling, propensity score
Original Pdf: pdf
10 Replies
Loading