Efficient Offline Preference-Based Reinforcement Learning with Transition-Dependent Discounting

17 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: preference-based reinforcement learning, offline reinforcement learning, RLHF, transition-dependent discounting
TL;DR: We propose a simple and effective method TED that achieves query-efficient offline perference-based RL.
Abstract: Offline preference-based reinforcement learning (OPBRL) tackles two major limitations of traditional reinforcement learning: the need for online interaction and the requirement for carefully designed reward labels. Despite recent progress, solving complex tasks with a small number of preference labels remains challenging, as the learned reward function is inaccurate when preference labels are scarce. To tackle this challenge, we first demonstrate that the inaccurate reward model predicts low-preference regions much more precisely than high-preference regions, as the former suffers less from generalization errors. By incorporating this insight with offline RL's pessimism property, we propose a novel OPBRL framework, Transition-dEpendent Discounting (TED), that excels in complex OPBRL tasks with only a small number of preference queries. TED assigns low transition-dependent discount factors to the predicted low-preference regions, which discourages the offline agent from visiting these regions and achieves higher performance. On the challenging Meta-World MT1 tasks, TED significantly outperforms current OPBRL baselines.
Supplementary Material: zip
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 793
Loading