Mixing Corrupted Preferences for Robust and Feedback-Efficient Preference-Based Reinforcement Learning

Jongkook Heo; Young Jae Lee; Jaehoon Kim; Min Gu Kwak; Youngjoon Park; Seoung Bum Kim

Mixing Corrupted Preferences for Robust and Feedback-Efficient Preference-Based Reinforcement Learning

Jongkook Heo, Young Jae Lee, Jaehoon Kim, Min Gu Kwak, Youngjoon Park, Seoung Bum Kim

18 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: preference-based reinforcement learning, label noise, robotic manipulation, locomotion

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Preference-based reinforcement learning addressing performance degradation caused by corrupted preferences

Abstract: Preference-based reinforcement learning (RL) trains agents using non-expert feedback without the need for detailed reward design. In this approach, a human teacher provides feedback to the agent by comparing two behavior trajectories and labeling the preference. Although recent studies have improved feedback efficiency through methods like unsupervised exploration to collect various trajectories and combined self- or semi-supervised learning for unlabeled queries, they often assume flawless human annotation. In practice, human teachers might make mistakes or have conflicting opinions about trajectory preferences. The potential negative impact of such corrupted preferences on capturing user intent remains an underexplored challenge. To address this challenge, we introduce mixing corrupted preferences (MCP) for robust and feedback-efficient preference-based RL. Mixup has shown robustness against corrupted labels by reducing the influence of faulty instances. By generating new preference data through the component-wise mixing of two labeled preferences, our method lessens the impact of corrupted feedback, thereby enhancing robustness. Furthermore, MCP improves feedback efficiency: even with limited labeled feedback, it can generate unlimited new data. We evaluate our method on three locomotion and six robotic manipulation tasks in B-Pref benchmark, comparing it with PEBBLE in contexts with both perfectly rational and imperfect teachers. Our results show that MCP significantly outperforms PEBBLE, requiring fewer feedback instances and a shorter training period, highlighting its superior feedback efficiency.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1107

Loading