Outward Odyssey: Improving Reward Models with Proximal Policy Exploration for Preference-Based Reinforcement Learning

Yiwen Zhu; Jinyi Liu; Yifu Yuan; Wenya Wei; Zhenxing Ge; qianyi fu; Shanqi Liu; Zhou Fang; Yujing Hu; Bo An

Outward Odyssey: Improving Reward Models with Proximal Policy Exploration for Preference-Based Reinforcement Learning

Yiwen Zhu, Jinyi Liu, Yifu Yuan, Wenya Wei, Zhenxing Ge, qianyi fu, Shanqi Liu, Zhou Fang, Yujing Hu, Bo An

28 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Preference-based Reinforcement Learning; Reinforcement Learning; Human Feedback

TL;DR: To enhance the reliability of the reward model for current policy improvement, we have developed the Proximal Policy Exploration (PPE) algorithm to increase the coverage of the preference buffer in areas close to the near-policy distribution.

Abstract: Reinforcement learning (RL) heavily depends on well-designed reward functions, which can be challenging to create and may introduce biases, especially for complex behaviors. Preference-based RL (PbRL) addresses this by using human feedback to construct a reward model that reflects human preferences, yet requiring considerable human involvement. To alleviate this, several PbRL methods aim to select queries that need minimal feedback. However, these methods do not directly enhance the data coverage within the preference buffer. In this paper, to emphasize the critical role of preference buffer coverage in determining the quality of the reward model, we first investigate and find that a reward model's evaluative accuracy is the highest for trajectories within the preference buffer's distribution and significantly decreases for out-of-distribution trajectories. Against this phenomenon, we introduce the **Proximal Policy Exploration (PPE)** algorithm, which consists of a *proximal-policy extension* method and a *mixture distribution query* method. To achieve higher preference buffer coverage, the *proximal-policy extension* method encourages active exploration of data within near-policy regions that fall outside the preference buffer's distribution. To balance the inclusion of in-distribution and out-of-distribution data, the *mixture distribution query* method proactively selects a mix of data from both outside and within the preference buffer's distribution for querying. PPE not only expands the preference buffer's coverage but also ensures the reward model's evaluative capability for in-distribution data. Our comprehensive experiments demonstrate that PPE achieves significant improvement in both human feedback efficiency and RL sample efficiency, underscoring the importance of preference buffer coverage in PbRL tasks.

Primary Area: reinforcement learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13478

Loading