Abstract: An increasingly important building block of large scale machine learning systems is based on returning slates; an ordered lists of items given a query. Applications of this technology include: search, information retrieval and recommender systems. When the action space is large, decision systems are restricted to a particular structure to complete online queries quickly. This paper addresses the optimization of these large scale decision systems given an arbitrary reward function. We cast this learning problem in a policy optimization framework and propose a new class of policies, born from a novel relaxation of decision functions. This results in a simple, yet efficient learning algorithm that scales to massive action spaces. We compare our method to the commonly adopted Plackett-Luce policy class and demonstrate the effectiveness of our approach on problems with action space sizes in the order of millions.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: The following changes were made to answer the concerns of the reviewers:
- Changed the presentation of the LGP method by introducing it directly in the main paper and we left the general presentation of the method (LRP) in the appendix.
- Extended the experiment section in the main paper by adding the performance of the different algorithms by iteration (instead of fixing the time budget) and its discussion.
Assigned Action Editor: ~Sebastian_Tschiatschek1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 1435
Loading