Provably Sample-Efficient Active Preference Data Collection

Provably Sample-Efficient Active Preference Data Collection

ICLR 2026 Conference Submission19419 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Active learning, Preference feedback, Neural dueling bandit, Contextual dueling bandit

TL;DR: This paper introduces an active learning algorithm based on neural contextual dueling bandits, providing a principled and practical approach for efficiently collecting preference feedback when the latent reward function is non-linear.

Abstract: Collecting human preference feedback is often expensive, leading recent works to develop algorithms to select them more efficiently. However, these works assume that the underlying reward function is linear, an assumption that does not hold in many real-life applications, e.g., online recommendation. To address this limitation, we propose Neural-ADB, an algorithm based on the neural contextual dueling bandit framework that provides a practical method for collecting human preference feedback when the underlying latent reward function is non-linear. We theoretically show that when preference feedback follows the Bradley-Terry-Luce model, the worst sub-optimality gap of the policy learned by Neural-ADB decreases at a sub-linear rate as the preference dataset increases. Our experimental results on preference datasets further corroborate the effectiveness of Neural-ADB.

Supplementary Material: zip

Primary Area: learning theory

Submission Number: 19419

Loading