Optimizing Reward Models with Proximal Policy Exploration in Preference-Based Reinforcement Learning

Yiwen Zhu; Jinyi Liu; Yifu Yuan; Wenya Wei; Zhenxing Ge; qianyi fu; Zhou Fang; Yujing Hu; Bo An

Optimizing Reward Models with Proximal Policy Exploration in Preference-Based Reinforcement Learning

Yiwen Zhu, Jinyi Liu, Yifu Yuan, Wenya Wei, Zhenxing Ge, qianyi fu, Zhou Fang, Yujing Hu, Bo An

Published: 10 Oct 2024, Last Modified: 31 Oct 2024NeurIPS 2024 Workshop on Behavioral MLEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Preference-based Reinforcement Learning; Reinforcement Learning; Human Feedback

TL;DR: To enhance the effectiveness of the reward model for proximal policies, we have developed the Proximal Policy Exploration (PPE) algorithm to increase the coverage of the preference buffer in areas close to the near-policy distribution.

Abstract: Traditional reinforcement learning (RL) relies on carefully designed reward functions, which are challenging to implement for complex behaviors and may introduce biases in real-world applications. Preference-based RL (PbRL) offers a promising alternative by using human feedback, yet its extensive demand for human input constrains scalability. To address that, this paper proposes a proximal policy exploration algorithm (**PPE**), designed to enhance the efficiency of human feedback by concentrating on near-policy regions. By incorporating a policy-aligned query mechanism, our approach not only increases the accuracy of the reward model but also reduces the need for extensive human interaction. Our results demonstrate that improving the reward model's evaluative precision in near-policy regions enhances policy optimization reliability, ultimately boosting overall performance. Furthermore, our comprehensive experiments show that actively encouraging diversity in feedback substantially improves human feedback efficiency.

Submission Number: 29

Loading