Truly Optimal Inverse Propensity Scoring for Off-Policy Evaluation with Multiple Loggers

Joon Suk Huh; Junghoon Seo

Truly Optimal Inverse Propensity Scoring for Off-Policy Evaluation with Multiple Loggers

Joon Suk Huh, Junghoon Seo

12 Sept 2025 (modified: 23 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: off-policy evaluation, inverse propensity scoring, multiple loggers, variance reduction

Abstract: We study off-policy evaluation (OPE) in contextual bandits with data collected from multiple logging policies. As highlighted by Agarwal et al. (2017), there seemingly exists no IPS estimator that consistently outperforms the others in this setting. We resolve this dilemma by deriving an optimal IPS estimator with sample-dependent weights that minimize variance. Through a calculus-of-variations approach, we obtain closed-form optimal weights under the unbiasedness condition, yielding an estimator that is unbiased and achieves asymptotically optimal variance. Experiments on four benchmark datasets confirm this resolution in practice, showing that our estimator consistently outperforms state-of-the-art methods with substantial relative RMSE reductions across diverse logger mixtures and numbers of logging policies.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 4592

Loading