PB²: Preference Space Exploration via Population-Based Methods in Preference-Based Reinforcement Learning

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Deep Reinforcement Learning, Preference-based Reinforcement Learning
Abstract: Preference-based reinforcement learning (PbRL) has emerged as a promising ap- proach for learning behaviors from human feedback without predefined reward functions. However, current PbRL methods face a critical challenge in effectively exploring the preference space, often converging prematurely to suboptimal policies that satisfy only a narrow subset of human preferences. In this work, we identify and address this preference exploration problem through population-based meth- ods. We demonstrate that maintaining a diverse population of agents enables more comprehensive exploration of the preference landscape compared to single-agent approaches. Crucially, this diversity improves reward model learning by generating preference queries with clearly distinguishable behaviors, a key factor in real-world scenarios where humans must easily differentiate between options to provide mean- ingful feedback. Our experiments reveal that current methods may fail by getting stuck in local optima, requiring excessive feedback, or degrading significantly when human evaluators make errors on similar trajectories, a realistic scenario often overlooked by methods relying on perfect oracle teachers. Our population-based approach demonstrates robust performance when teachers mislabel similar trajec- tory segments and shows significantly enhanced preference exploration capabilities, particularly in environments with complex reward landscapes.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 22887
Loading