Keywords: off-policy learning, recommender systems, retrieval augmented generation, candidate retrieval, importance sampling, diversity
TL;DR: This paper studies an off-policy learning approach for candidate retrieval in two-stage decision systems.
Abstract: Two-stage decision systems, which first retrieve a candidate set of items (e.g., fashion items or documents) and then generate the output (e.g., ranked featured items or articles) from the candidates are widely adopted in real-life applications, including search, recommendations, and retrieval-augmented generation (RAG). Diversity in the candidate set is considered a crucial aspect in news recommendation or opinion summarization. However, conventional approaches to candidate retrieval fail to incorporate diversity without post-processing, as they model a single representation of user preference and ignore the (multi-modal) distribution of user preferences on diverse items. To circumvent this issue, we propose a novel Off-Policy Learning (OPL) framework that can (1) model the multi-modal distribution of user preference and (2) optimize the preference distribution and candidate set to maximize the user engagement signal, using logged bandit feedback. Moreover, we present a Kernel Importance Sampling (Kernel IS)-based policy gradient estimator to mitigate the issues of high variance, deficient support, and severe rejection sampling caused by the vanilla IS policy gradient, and provide theoretical guarantees about its bias and variance.
Submission Number: 3
Loading