Scaling Marginalized Importance Sampling to High-Dimensional State-Spaces via State AbstractionDownload PDF

05 Oct 2022 (modified: 05 May 2023)Offline RL Workshop NeurIPS 2022Readers: Everyone
Keywords: reinforcement learning, off-policy, offline RL, abstraction, representation learning, importance sampling, density ratios
TL;DR: We propose to project high-dimensional state-spaces into lower-dimensional state-spaces using state abstraction to improve the accuracy of marginalized importance sampling off-policy evaluation (OPE) algorithms on high-dimensional state OPE tasks.
Abstract: We consider the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where the goal is to estimate the performance of an evaluation policy, $\pi_e$, using a fixed dataset, $\mathcal{D}$, collected by one or more policies that may be different from $\pi_e$. Current OPE algorithms may produce poor OPE estimates under policy distribution shift i.e., when the probability of a particular state-action pair occurring under $\pi_e$ is very different from the probability of that same pair occurring in $\mathcal{D}$ (Voloshin et al. 2021, Fu et al. 2021). In this work, we propose to improve the accuracy of OPE estimation by projecting the ground state-space into a lower-dimensional state-space using concepts from the state abstraction literature in RL. Specifically, we consider marginalized importance sampling (MIS) OPE algorithms which compute distribution correction ratios to produce their OPE estimate. In the original state-space, these ratios may have high variance which may lead to high variance OPE. However, we prove that in the lower-dimensional abstract state-space the ratios can have lower variance resulting in lower variance OPE. We then present a minimax optimization problem that incorporates the state abstraction. Finally, our empirical evaluation on difficult, high-dimensional state-space OPE tasks shows that the abstract ratios can make MIS OPE estimators achieve lower mean-squared error and more robust to hyperparameter tuning than the ground ratios.
2 Replies