Keywords: Deep reinforcement learning, on-policy, proximal policy optimization
TL;DR: We enhance PPO by using proper action bounds, off-policy data to train the critic, and adding maximum entropy bonus, while simplifying the implementation.
Abstract: On-policy Reinforcement Learning (RL) offers desirable features such as stable learning, fewer policy updates, and the ability to evaluate a policy’s return during training. While recent efforts have focused on off-policy methods, achieving significant advancements, PPO remains the go-to algorithm for on-policy RL due to its apparent simplicity and effectiveness. Nonetheless, PPO is highly sensitive to hyperparameters and relies on subtle, often poorly documented adjustments that can critically affect its performance, thereby limiting its utility in complex scenarios. In this paper, we revisit the PPO algorithm by introducing principled enhancements that improve performance while eliminating the need for extensive hyperparameter tuning and implementation-specific optimizations. Our proposed approach, PPO+, is a principled adaptation of the PPO algorithm that strengthens its adherence to the on-policy objective, enhancing stability and efficiency.
PPO+ sets a new state-of-the-art for deep on-policy RL on MuJoCo control problems while maintaining a straightforward implementation.
PPO+ demonstrates significantly improved asymptotic performance over PPO, and a substantially reduced performance gap with off-policy algorithms in several challenging continuous control tasks.
Beyond just performance, our findings offer a fresh perspective on on-policy RL.
Confirmation: I understand that authors of each paper submitted to EWRL may be asked to review 2-3 other submissions to EWRL.
Serve As Reviewer: ~Mahdi_Kallel1
Track: Regular Track: unpublished work
Submission Number: 104
Loading