Query-Efficient Offline Preference-Based Reinforcement Learning via In-Dataset Exploration

17 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: preference-based RL, offline RL, reinforcement learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose a principled way for efficient query selection for offline preference-based RL.
Abstract: Preference-based reinforcement learning has shown great promise in various applications to avoid reward annotations and align better with human intentions. However, obtaining preference feedback can still be expensive or time-consuming, which forms a strong barrier for preference-based RL. In this paper, we propose a novel approach to improve the query efficiency of offline preference-based RL by introducing the concept of in-dataset exploration. In-dataset exploration consists of two key features: weighted trajectory queries and a principled pairwise exploration strategy that balances between pessimism over transitions and optimism over reward functions. We show that such a strategy leads to a provably efficient algorithm that judiciously selects queries to minimize the overall number of queries while ensuring robust performance. We further design an empirical version of our algorithm that tailors the theoretical insights to practical settings. Experiments on various tasks demonstrate that our approach achieves strong performance with significantly fewer queries than state-of-the-art methods.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 863
Loading