Keywords: Preference Learning, Reinforcement Learning from Human Feedback
TL;DR: PAWS is a novel preference learning approach that leverages advantage-weighted policy updates on segments to mitigate the temporal credit assignment problem.
Abstract: Training agents that align with human intentions is a central challenge in machine learning. Preference-based reinforcement learning (PbRL) has emerged as a promising paradigm by leveraging human feedback in the form of trajectory-level comparisons, thereby avoiding the need for explicit reward design or expert demonstrations. However, existing PbRL methods typically rely on per-step reward assignments inferred from trajectory preferences, which introduces inconsistencies and exacerbates the temporal credit assignment problem. In this work, we analyze this issue and demonstrate its adverse impact on policy learning. To address this problem, we propose Preference Learning with Advantage-weighted Segments (PAWS), a novel segment-based preference learning method that updates policies directly with segment-level advantage functions. By preserving segment-level preference information, PAWS ensures stable policy updates while avoiding misleading per-step reward signals. Empirical evaluations on a diverse set of simulated robot manipulation tasks, as well as locomotion tasks, show that PAWS achieves higher task-specific performance over existing PbRL approaches, highlighting the effectiveness of our method in aligning policies with human preferences.
Primary Area: reinforcement learning
Submission Number: 11060
Loading