Off-policy Bandits with Deficient Support

Noveen Sachdeva; Yi Su; Thorsten Joachims

Off-policy Bandits with Deficient Support

Noveen Sachdeva, Yi Su, Thorsten Joachims

25 Sept 2019 (modified: 26 May 2025)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Abstract: Off-policy training of contextual-bandit policies is attractive in online systems (e.g. search, recommendation, ad placement), since it enables the reuse of large amounts of log data from the production system. State-of-the-art methods for off-policy learning, however, are based on inverse propensity score (IPS) weighting, which requires that the logging policy chooses all actions with non-zero probability for any context (i.e., full support). In real-world systems, this condition is often violated, and we show that existing off-policy learning methods based on IPS weighting can fail catastrophically. We therefore develop new off-policy contextual-bandit methods that can controllably and robustly learn even when the logging policy has deficient support. To this effect, we explore three approaches that provide various guarantees for safe learning despite the inherent limitations of support deficient data: restricting the action space, reward extrapolation, and restricting the policy space. We analyze the statistical and computational properties of these three approaches, and empirically evaluate their effectiveness in a series of experiments. We find that controlling the policy space is both computationally efficient and that it robustly leads to accurate policies.

Keywords: Recommender System, Search Engine, Counterfactual Learning

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/off-policy-bandits-with-deficient-support/code)

Original Pdf: pdf

9 Replies

Loading