Keywords: Reinforcement learning, on-policy, maximum entropy RL
TL;DR: An on-policy adaptation of soft-actor critic that brings deep on-policy RL closer to its theoretical foundations.
Abstract: On-policy Reinforcement Learning (RL) offers several desirable properties, including more stable learning, less frequent policy changes, and the capacity to evaluate a policy's return during the learning process. Despite the considerable success of recent off-policy methods, their on-policy counterparts continue to lag in terms of asymptotic performance and sample efficiency. Proximal Policy Optimization (PPO) remains the de facto standard, despite its complexity and demonstrated sensitivity to hyperparameters. In this work, we introduce On-Policy Soft Actor-Critic (ON-SAC), a methodical adaptation of the Soft Actor-Critic (SAC) algorithm tailored for the on-policy setting. Our approach begins with the observation that the current on-policy algorithms do not use true on-policy gradients. We build on this observation to offer founded remedies for this problem. Our algorithm establishes a new state-of-the-art for deep on-policy RL, while simplifying the process by eliminating the need for trust-region methods and intricate critic learning schemes.
Submission Number: 118
Loading