Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

Masatoshi Uehara; Haruka Kiyohara; Andrew Bennett; Victor Chernozhukov; Nan Jiang; Nathan Kallus; Chengchun Shi; Wen Sun

Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

Masatoshi Uehara, Haruka Kiyohara, Andrew Bennett, Victor Chernozhukov, Nan Jiang, Nathan Kallus, Chengchun Shi, Wen Sun

Published: 21 Sept 2023, Last Modified: 14 Nov 2023NeurIPS 2023 spotlightEveryoneRevisionsBibTeX

Keywords: Reinforcement learning theory, POMDP, PAC RL, Off-policy evaluation, Offilne reinforcement learning

TL;DR: Model-free off-policy evaluation in POMDPs without a curse of horizon

Abstract: We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators and fitted-Q evaluation suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs. Future-dependent value functions play similar roles as classical value functions in fully-observable MDPs. We derive a new off-policy Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is close to the true policy value as long as futures and histories contain sufficient information about latent states, and the Bellman completeness. Our code is available at https://github.com/aiueola/neurips2023-future-dependent-ope

Supplementary Material: zip

Submission Number: 12388

Loading