Keywords: Partial Observability, Model Misspecification, Off-Policy Evaluation
Abstract: Models in reinforcement learning are often estimated from offline data, which in many real-world scenarios is subject to partial observability.
In this work, we study the challenges that emerge from using models estimated from partially-observable offline data for policy evaluation.
Notably, a complete definition of the models includes dependence on the data-collecting policy.
To address this issue, we introduce a method for model estimation that
incorporates importance weighting in the model learning process.
The off-policy samples are reweighted to be reflective of their probabilities under a different policy, such that the resultant model is a consistent estimator of the off-policy model and provides consistent estimates of the expected off-policy return.
This is a crucial step towards the reliable and responsible use of models learned under partial observability, particularly in scenarios where inaccurate policy evaluation can have catastrophic consequences.
We empirically demonstrate the efficacy of our method and its resilience to common approximations such as weight clipping on a range of domains with diverse types of partial observability.
Submission Number: 51
Loading