PB²: Preference Space Exploration via Population-Based Methods in Preference-Based Reinforcement Learning
Keywords: Reinforcement Learning, Deep Reinforcement Learning, Preference-based Reinforcement Learning
Abstract: Preference-based reinforcement learning (PbRL) has emerged as a promising ap-
proach for learning behaviors from human feedback without predefined reward
functions. However, current PbRL methods face a critical challenge in effectively
exploring the preference space, often converging prematurely to suboptimal policies
that satisfy only a narrow subset of human preferences. In this work, we identify
and address this preference exploration problem through population-based meth-
ods. We demonstrate that maintaining a diverse population of agents enables more
comprehensive exploration of the preference landscape compared to single-agent
approaches. Crucially, this diversity improves reward model learning by generating
preference queries with clearly distinguishable behaviors, a key factor in real-world
scenarios where humans must easily differentiate between options to provide mean-
ingful feedback. Our experiments reveal that current methods may fail by getting
stuck in local optima, requiring excessive feedback, or degrading significantly when
human evaluators make errors on similar trajectories, a realistic scenario often
overlooked by methods relying on perfect oracle teachers. Our population-based
approach demonstrates robust performance when teachers mislabel similar trajec-
tory segments and shows significantly enhanced preference exploration capabilities,
particularly in environments with complex reward landscapes.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 22887
Loading