Incremental Policy Gradients for Online Reinforcement Learning Control

Kristopher De Asis; Alan Chan; Yi Wan; Richard S. Sutton

Incremental Policy Gradients for Online Reinforcement Learning Control

Kristopher De Asis, Alan Chan, Yi Wan, Richard S. Sutton

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Withdrawn SubmissionReaders: Everyone

Keywords: reinforcement learning, policy gradient, incremental, online, eligibility traces

Abstract: Policy gradient methods are built on the policy gradient theorem, which involves a term representing the complete sum of rewards into the future: the return. Due to this, one usually either waits until the end of an episode before performing updates, or learns an estimate of this return--a so-called critic. Our emphasis is on the first approach in this work, detailing an incremental policy gradient update which neither waits until the end of the episode, nor relies on learning estimates of the return. We provide on-policy and off-policy variants of our algorithm, for both the discounted return and average reward settings. Theoretically, we draw a connection between the traces our methods use and the stationary distributions of the discounted and average reward settings. We conclude with an experimental evaluation of our methods on both simple-to-understand and complex domains.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Supplementary Material: zip

Reviewed Version (pdf): https://openreview.net/references/pdf?id=BF5Ecz0Zgi

1 Reply

Loading