Keywords: Deep reinforcement learning, on-policy, policy gradients
TL;DR: We enhance PPO by using discounted policy gradients, off-policy data to train the critic, and adding maximum entropy bonus, while simplifying the implementation.
Abstract: On-policy Reinforcement Learning (RL) offers desirable features such as stable learning, fewer policy updates, and the ability to evaluate a policy’s return during training. While recent efforts have focused on off-policy methods, achieving significant advancements, Proximal Policy Optimization (PPO) remains the go-to algorithm for on-policy RL due to its apparent simplicity and effectiveness. However, despite its apparent simplicity, PPO is highly sensitive to hyperparameters and depends on subtle and poorly documented tweaks that can make or break its success--hindering its applicability in complex problems. In this paper, we revisit on-policy deep RL with a focus on improving PPO, by introducing principled solutions that enhance its performance while eliminating the need for extensive hyperparameter tuning and implementation-level optimizations. Our effort leads to PPO+, a methodical adaptation of the PPO algorithm that adheres closer to its theoretical foundations.
PPO+ sets a new state-of-the-art for on-policy RL on MuJoCo control problems while maintaining a straightforward trick-free implementation. Beyond just performance, our findings offer a fresh perspective on on-policy RL that could reignite interest in these approaches.
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3742
Loading