Efficient Offline Learning of Ranking Policies via Top-$k$ Policy Decomposition

Ren Kishimoto; Koichi Tanaka; Haruka Kiyohara; Yusuke Narita; Nobuyuki Shimizu; Yasuo Yamamoto; Yuta Saito

Efficient Offline Learning of Ranking Policies via Top-$k$ Policy Decomposition

Ren Kishimoto, Koichi Tanaka, Haruka Kiyohara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, Yuta Saito

Published: 19 Jun 2024, Last Modified: 26 Jul 2024ARLET 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: off-policy learning, policy gradient, importance weighting, ranking

TL;DR: we propose a new OPL method for ranking, which combines the policy- and regression-based approaches in an effective fashion.

Abstract: We study \textit{Off-Policy Learning} (OPL) of ranking policies, which enables us to learn new ranking policies using only historical logged data. Ranking settings make OPL remarkably challenging because their action spaces consist of permutations of unique items, being extremely large. Existing methods, which primarily use either policy- or regression-based approaches, suffer from high variance and bias respectively. To circumvent these issues of existing methods, we propose a new OPL method for ranking, named \textbf{\textit{Ranking Policy Optimization via Top-$k$ Policy Decomposition (R-POD)}}, which combines the policy- and regression-based approaches in an effective fashion. Specifically, R-POD decomposes a ranking policy into a first-stage policy for selecting \textit{top-$k$} actions and a second-stage policy for choosing the bottom actions given the top-$k$ actions. It then learns the first-stage policy via the policy-based approach and the second-stage policy via the regression-based approach. In particular, we propose a new policy gradient estimator to learn the first-stage policy via the policy-based approach. This method can substantially reduce variance, since it applies importance weighting only to the top-$k$ actions. We also demonstrate that our policy-gradient estimator for the first-stage policy is unbiased under a \textit{conditional pairwise correctness} condition, which only requires that the expected reward differences of pairs of rankings sharing the same top-$k$ actions can be estimated correctly. Comprehensive experiments illustrate that R-POD provides substantial improvements in OPL for ranking where existing methods fail due to large action spaces.

Submission Number: 13

Loading