Wei, Z., Miao, R. and Qu, A. (2026) “Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random.”

Annie Qu

Published: 30 Apr 2026, Last Modified: 07 May 2026ICMLEveryonearXiv.org perpetual, non-exclusive license

Abstract: In offline Reinforcement Learning, immediate rewards in logged batch data are often unobserved due to sparse or irregular record-keeping, or censored beyond certain reward values. This issue arises in practical settings, including health care and marketing. We investigate off-policy evaluation (OPE) in finite-horizon Markov decision processes when rewards are missing not at random (MNAR), which breaks ignorability and induces selection bias even after conditioning on states and actions. To address this, we formalize a reward-dependent propensity model and use future states as shadow variables to identify the full-data conditional mean reward. We further introduce a bridge function that recovers the conditional mean reward without explicitly modeling the MNAR mechanism, and estimate it via a minmax procedure to avoid double sampling. Building upon these identification results, we propose a Fitted-Q-Evaluation-style estimator that propagates the recovered rewards while allowing target policies to depend on past missingness indicators. Finally, we establish consistency and finitesample error bounds for our OPE estimator, and show through simulations the strong performance of our method compared to existing benchmarks.