Keywords: Reinforcement Learning, Offline Policy Evaluation, OPE
TL;DR: We prove bounds on the error of an Offline Policy Evaluation estimator when the data it computes on are not independent or identically distributed.
Abstract: Offline RL is an important step towards making data-hungry RL algorithms more widely usable in the real world, but conventional assumptions on the distribution of logging data do not apply in some key real-world scenarios. In particular, it is unrealistic to assume that RL practitioners will have access to sets of trajectories that simultaneously are mutually independent and explore well. We propose two natural ways to relax these assumptions: by allowing the data to be distributed according to different logging policies independently, and by allowing logging policies to depend on past trajectories. We discuss Offline Policy Evaluation (OPE) in these settings, analyzing the performance of a model-based OPE estimator when the MDP is tabular.