Proximal policy optimization with reward-based prioritization

Published: 01 Jan 2025, Last Modified: 24 May 2025Expert Syst. Appl. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The PPO (Proximal Policy Optimization) algorithm is a policy optimization-based deep reinforcement learning algorithm that has achieved outstanding results and widespread applications. Despite the popularity of the PPO algorithm, it has several notable drawbacks, including its sensitivity to hyperparameters, slow convergence, and limited exploration. In recent research, code-level optimization has been proposed to improve training effectiveness and achieved good performance. In this paper, we propose a Proximal Policy Optimization with Reward-based Prioritization (RP-PPO) algorithm that gives different experiences different priorities to update policy based on reward and find the model that gets the highest average reward. We also apply some minor techniques including Normalized Reward and Dual Learning Rate Decay to optimize our algorithm. Finally, we conduct a series of experiments and tests in four problem domains in the Gym environment to demonstrate the superiority of our algorithm.
Loading