Keywords: off-policy evaluation, inverse propensity scoring, multiple loggers, variance reduction
Abstract: We study off-policy evaluation (OPE) in contextual bandits with data collected from multiple logging policies. As highlighted by Agarwal et al. (2017), there seemingly exists no IPS estimator that consistently outperforms the others in this setting. We resolve this dilemma by deriving an optimal IPS estimator with sample-dependent weights that minimize variance. Through a calculus-of-variations approach, we obtain closed-form optimal weights under the unbiasedness condition, yielding an estimator that is unbiased and achieves asymptotically optimal variance. Experiments on four benchmark datasets confirm this resolution in practice, showing that our estimator consistently outperforms state-of-the-art methods with substantial relative RMSE reductions across diverse logger mixtures and numbers of logging policies.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 4592
Loading