Multi-Task Off-Policy Learning from Bandit Feedback

Published: 24 Apr 2023, Last Modified: 15 Jun 2023ICML 2023 PosterEveryoneRevisions
Abstract: Many practical problems involve solving similar tasks. In recommender systems, the tasks can be users with similar preferences; in search engines, the tasks can be items with similar affinities. To learn statistically efficiently, the tasks can be organized in a hierarchy, where the task affinity is captured using an unknown latent parameter. We study the problem of off-policy learning for similar tasks from logged bandit feedback. To solve the problem, we propose a hierarchical off-policy optimization algorithm HierOPO. The key idea is to estimate the task parameters using the hierarchy and then act pessimistically with respect to them. To analyze the algorithm, we develop novel Bayesian error bounds. Our bounds are the first in off-policy learning that improve with a more informative prior and capture statistical gains due to hierarchical models. Therefore, they are of a general interest. HierOPO also performs well in practice. Our experiments demonstrate the benefits of using the hierarchy over solving each task independently.
Submission Number: 5034