RewardRank: Optimizing True Learning-to-Rank Utility

ICLR 2026 Conference Submission16042 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: learning-to-rank, reward model, ranking utlity, Plackett-Luce, continuous relaxations
TL;DR: Learning-To-Rank to Maximize Counterfactual Utility
Abstract: Traditional ranking systems optimize offline proxy objectives that assume simplistic user behaviors, overlooking factors such as position bias and item diversity. As a result, they fail to improve desired counterfactual ranking utility like click-through-rate or purchase probability computed during online A/B tests. We introduce RewardRank, a data-driven learning-to-rank (LTR) framework for counterfactual utility maximization. RewardRank first trains a reward model to predict the utility of any ranking from logged data, then optimizes a ranker to maximize this reward via a differentiable soft permutation operator. To address the absence of any large-scale reproducible counterfactual LTR benchmarks, we propose two evaluation suites: (i) Parametric Oracle Evaluation (PO-Eval), which uses an open-source click model as the counterfactual oracle on Baidu-ULTR dataset, and (ii) LLM-As-User Evaluation (LAU-Eval), which simulates realistic user interactions using a large language model on Amazon-KDD-Cup dataset. RewardRank achieves the highest counterfactual utility on both suites and sets a new state of the art in relevance performance on Baidu-ULTR with real-click signals, demonstrating the feasibility of directly optimizing ranking policies for counterfactual utility. Our code is available at: https://anonymous.4open.science/r/RewardRank-EE46.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16042
Loading