Bipartite Ranking From Multiple Labels: On Loss Versus Label Aggregation

Michal Lukasik; Lin Chen; Harikrishna Narasimhan; Aditya Krishna Menon; Wittawat Jitkrittum; Felix X. Yu; Sashank J. Reddi; Gang Fu; Mohammadhossein Bateni; Sanjiv Kumar

Bipartite Ranking From Multiple Labels: On Loss Versus Label Aggregation

Michal Lukasik, Lin Chen, Harikrishna Narasimhan, Aditya Krishna Menon, Wittawat Jitkrittum, Felix X. Yu, Sashank J. Reddi, Gang Fu, Mohammadhossein Bateni, Sanjiv Kumar

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

Abstract: Bipartite ranking is a fundamental supervised learning problem, with the goal of learning a ranking over instances with maximal area under the ROC curve (AUC) against a single binary target label. However, one may often observe multiple binary target labels, e.g., from distinct human annotators. How can one synthesize such labels into a single coherent ranking? In this work, we formally analyze two approaches to this problem—loss aggregation and label aggregation—by characterizing their Bayes-optimal solutions. We show that while both approaches can yield Pareto-optimal solutions, loss aggregation can exhibit label dictatorship: one can inadvertently (and undesirably) favor one label over others. This suggests that label aggregation can be preferable to loss aggregation, which we empirically verify.

Lay Summary: Imagine you're judging a talent show with several other judges. Each judge provides a simple "yes" or "no" vote for each contestant. The challenge is, how do you combine all these individual votes to create a single, fair ranking of all the contestants? This is a common problem in AI, such as when ranking search results using different signals like "relevance" and "user engagement." Our research investigates two common ways to solve this. One method, loss aggregation, is like weighting and adding up each judge's final scorecard. The other, label aggregation, is like first creating a combined vote for each contestant from the individual "yes/no"s, and then ranking them based on that. We discovered a critical flaw in the first method: it can lead to "label dictatorship." This means the final ranking might accidentally be dominated by a single judge's opinion, not because they are wiser, but simply because they are very picky or very generous (their votes are statistically "skewed"). Our work shows that the second method, label aggregation, avoids this problem and provides a more balanced and reliable ranking. This insight helps developers build fairer and more predictable ranking systems.

Primary Area: General Machine Learning->Supervised Learning

Keywords: learning to rank, area under the curve, multi-objective optimization

Submission Number: 14546

Loading