Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data

Sunil Madhow; Dan Qiao; Yu-Xiang Wang

Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data

Sunil Madhow, Dan Qiao, Yu-Xiang Wang

05 Oct 2022 (modified: 05 May 2023)Offline RL Workshop NeurIPS 2022Readers: Everyone

Keywords: Reinforcement Learning, Offline Policy Evaluation, OPE

TL;DR: We prove bounds on the error of an Offline Policy Evaluation estimator when the data it computes on are not independent or identically distributed.

Abstract: Offline RL is an important step towards making data-hungry RL algorithms more widely usable in the real world, but conventional assumptions on the distribution of logging data do not apply in some key real-world scenarios. In particular, it is unrealistic to assume that RL practitioners will have access to sets of trajectories that simultaneously are mutually independent and explore well. We propose two natural ways to relax these assumptions: by allowing the data to be distributed according to different logging policies independently, and by allowing logging policies to depend on past trajectories. We discuss Offline Policy Evaluation (OPE) in these settings, analyzing the performance of a model-based OPE estimator when the MDP is tabular.

2 Replies

Loading