ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages

18 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Reinforcement Learning, On-Policy, Deep Learning, Actor-Critic, Exploration, Generalization
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: Towards approximate Bayesian inference in on-policy actor-critic deep reinforcement learning without compromising efficacy.
Abstract: This paper introduces a novel method for enhancing the effectiveness of the Asynchronous Advantage Actor-Critic (A3C) algorithm by incorporating state-aware exploration. We achieve this improvement through three simple yet impactful modifications: (1) applying a ReLU function to advantage estimates, (2) using spectral normalization, and (3) incorporating dropout. We prove, under standard assumptions, that restricting policy updates to positive advantages optimizes a lower bound on the value function plus a constant. Further, we show that the constant is bounded proportional to the Lipschitz constant of the value function, which offers theoretical grounding for the use of spectral normalization. r application of dropout corresponds to approximate Bayesian inference over both the actor and critic parameters, which enables prudent exploration around the modes of the actor via Thompson sampling. Extensive empirical evaluations on diverse benchmarks reveal the superior performance of our approach compared to existing on-policy algorithms. Notably, we achieve significant improvements over Proximal Policy Optimization (PPO) in both the challenging ProcGen generalization benchmark, and the MuJoCo benchmark for continuous control.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1495
Loading