Abstract: We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we derive generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. Using the CRM principle, we derive a new learning algorithm -- Policy Optimizer for Exponential Models (POEM) -- for structured output prediction. We evaluate POEM on several multi-label classification problems and verify that its empirical performance supports the theory.
0 Replies
Loading