Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER

Markus Holzleitner; Lukas Gruber; Jose Arjona-Medina; Johannes Brandstetter; Sepp Hochreiter

Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER

Markus Holzleitner, Lukas Gruber, Jose Arjona-Medina, Johannes Brandstetter, Sepp Hochreiter

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Withdrawn SubmissionReaders: Everyone

Keywords: reinforcement learning, actor critic algorithms, policy gradient methods, stochastic approximation, PPO, RUDDER

Abstract: We prove under commonly used assumptions the convergence of actor-critic reinforcement learning algorithms, which simultaneously learn a policy function, the actor, and a value function, the critic. Both functions can be deep neural networks of arbitrary complexity. Our framework allows showing convergence of the well known Proximal Policy Optimization (PPO) and of the recently introduced RUDDER. For the convergence proof we employ recently introduced techniques from the two time-scale stochastic approximation theory. Our results are valid for actor-critic methods that use episodic samples and that have a policy that becomes more greedy during learning. Previous convergence proofs assume linear function approximation, cannot treat episodic examples, or do not consider that policies become greedy. The latter is relevant since optimal policies are typically deterministic.

One-sentence Summary: We show local convergence of an abstract actor-critic setting and apply it to a version of PPO and RUDDER under practical assumptions.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Reviewed Version (pdf): https://openreview.net/references/pdf?id=IwcF0CjghA

6 Replies

Loading