Comparing Few to Rank Many: Active Human Preference Learning Using Randomized Frank-Wolfe Method

Kiran Koshy Thekumparampil; Gaurush Hiranandani; Kousha Kalantari; Shoham Sabach; Branislav Kveton

Comparing Few to Rank Many: Active Human Preference Learning Using Randomized Frank-Wolfe Method

Kiran Koshy Thekumparampil, Gaurush Hiranandani, Kousha Kalantari, Shoham Sabach, Branislav Kveton

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose an efficient active learning algorithm for learning human preferences from K-way comparison of a large number of choices.

Abstract: We study learning human preferences from limited comparison feedback, a core machine learning problem that is at the center of reinforcement learning from human feedback (RLHF). We formulate the problem as learning a Plackett-Luce (PL) model from a limited number of $K$-subset comparisons over a universe of $N$ items, where typically $K \ll N$. Our objective is to select the $K$-subsets such that all items can be ranked with minimal mistakes within the budget. We solve the problem using the D-optimal design, which minimizes the worst-case ranking loss under the estimated PL model. All known algorithms for this problem are computationally infeasible in our setting because we consider exponentially many subsets in $K$. To address this challenge, we propose a randomized Frank-Wolfe algorithm with memoization and sparse updates that has a low $O(N^2 + K^2)$ per-iteration complexity. We analyze it and demonstrate its empirical superiority on synthetic and open-source NLP datasets.

Lay Summary: Take any collection of objects, such as movies, books, or music. Suppose that you want to learn the preference of a person over these objects, but you can ask them only to compare a smaller subset of the objects. How would you do that in the minimum number of questions? We propose, analyze, and empirically evaluate a method that computes these questions.

Link To Code: https://github.com/tkkiran/DopeWolfe

Primary Area: General Machine Learning->Online Learning, Active Learning and Bandits

Keywords: active learning, human preference learning, comparison feedback, optimal design, Frank-Wolfe method

Submission Number: 15875

Loading