RIME: Robust Preference-based Reinforcement Learning with Noisy Human Preferences

Jie Cheng; Gang Xiong; Xingyuan Dai; Qinghai Miao; Yisheng Lv; Fei-Yue Wang

RIME: Robust Preference-based Reinforcement Learning with Noisy Human Preferences

Jie Cheng, Gang Xiong, Xingyuan Dai, Qinghai Miao, Yisheng Lv, Fei-Yue Wang

20 Sept 2023 (modified: 06 Feb 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: preference-based reinforcement learning, human-in-the-loop reinforcement learning, deep reinforcement learning

TL;DR: We present a robust preference-based RL method for effective reward learning from noisy human preferences through a warm start denoising discriminator.

Abstract: Designing an effective reward function remains a significant challenge in numerous reinforcement learning (RL) applications. Preference-based Reinforcement Learning (PbRL) presents a novel framework that circumvents the need for reward engineering by harnessing human preferences as the reward signal. However, current PbRL algorithms primarily focus on feedback efficiency, which heavily depends on high-quality feedback from domain experts. This over-reliance results in a lack of robustness, leading to a severe performance degradation under noisy feedback conditions, thereby limiting the broad applicability of PbRL. In this paper, we present RIME, a robust PbRL algorithm for effective reward learning from noisy human preferences. Our method incorporates a sample selection-based discriminator to dynamically filter denoised preferences for robust training. To mitigate the accumulated error caused by incorrect selection, we propose to warm start the reward model for a good initialization, which additionally bridges the performance gap during transition from pre-training to online training in PbRL. Our experiments on robotic manipulation and locomotion tasks demonstrate that RIME significantly enhances the robustness of the current state-of-the-art PbRL method. Ablation studies further demonstrate that the warm start is crucial for both robustness and feedback-efficiency in limited-feedback cases.

Supplementary Material: zip

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 2667

Loading