Demystifying the Paradox of Importance Sampling with an Estimated History-Dependent Behavior Policy in Off-Policy Evaluation

Hongyi Zhou; Josiah P. Hanna; Jin Zhu; Ying Yang; Chengchun Shi

Demystifying the Paradox of Importance Sampling with an Estimated History-Dependent Behavior Policy in Off-Policy Evaluation

Hongyi Zhou, Josiah P. Hanna, Jin Zhu, Ying Yang, Chengchun Shi

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Demystifying the Paradox of IS with an Estimated History-Dependent Behavior Policy in OPE

Abstract: This paper studies off-policy evaluation (OPE) in reinforcement learning with a focus on behavior policy estimation for importance sampling. Prior work has shown empirically that estimating a history-dependent behavior policy can lead to lower mean squared error (MSE) even when the true behavior policy is Markovian. However, the question of *why* the use of history should lower MSE remains open. In this paper, we theoretically demystify this paradox by deriving a bias-variance decomposition of the MSE of ordinary importance sampling (IS) estimators, demonstrating that history-dependent behavior policy estimation decreases their asymptotic variances while increasing their finite-sample biases. Additionally, as the estimated behavior policy conditions on a longer history, we show a consistent decrease in variance. We extend these findings to a range of other OPE estimators, including the sequential IS estimator, the doubly robust estimator and the marginalized IS estimator, with the behavior policy estimated either parametrically or non-parametrically.

Lay Summary: Evaluating reinforcement learning policies accurately is often required in A/B testing, which is frequently used in modern technology companies such as Amazon, eBay, Facebook, Google, Microsoft, Uber for comparing new products/strategies against existing ones. A common method is importance sampling, which relies on estimating the treatment assignment mechanism for historical data. Interestingly, prior studies noticed that estimating this policy using more historical data leads to lower evaluation errors, but why this occurs wasn't clear. In this paper, we mathematically explain this phenomenon by breaking down the evaluation error into bias and variance components. Our analysis reveals that though including historical data slightly increases bias in small samples but asymptotically reduces variance overall. We also extend our theoretical findings to several widely-used policy evaluation methods, provides theoretical insights for applying our theory to practice.

Primary Area: Theory->Reinforcement Learning and Planning

Keywords: Reinforcement Learning; Off-policy Evaluation;

Submission Number: 11029

Loading