SEER: Towards Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation

23 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: preference-based reinforcement learning, human-in-the-loop reinforcement learning, deep reinforcement learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: A novel method for effective querying and human preference alignment in preference-based reinforcement learning
Abstract: One of challenge in reinforcement learning lies in the meticulous design of a reward function that quantifies the quality of each decision as a scalar value. Preference-based reinforcement learning (PbRL) provides an alternative approach, avoiding reward engineering by learning rewards based on human preferences among various trajectories. PbRL involves sampling informative trajectories, learning rewards from preferences, optimizing policy with learned rewards, and subsequently generating higher-quality trajectories for the next iteration, thereby creating a virtuous circle. Distinct problems lie in effective reward learning and aligning the policy with human preferences, both of which are essential for achieving efficient learning. Motivated by these considerations, we propose an efficient preference-based RL method, dubbed SEER. We leverage state-action pairs that are well-supported in the current replay memory to bootstrap an empirical Q function ($\widehat{Q}$), which is aligned with human preference. The empirical Q function helps SEER to sample more informative pairs for effective querying, and regularizes the neural Q function ($Q_\theta$) thus leading to a policy which is more consistent with human intent. Theoretically, we show that the empirical Q function is a lower-bound of the oracle Q under human preference. Our experimental results over several tasks demonstrate that the empirical Q function is beneficial for preference-based RL to learn a more aligned Q function, outperforming state-of-the-art methods by a large margin.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7946
Loading