Average Reward Reinforcement Learning with Monotonic Policy ImprovementDownload PDF

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone
Keywords: Reinforcement Learning, Deep Reinforcement Learning, Average Reward, Policy Optimization, Model Free RL
Abstract: In continuing control tasks, an agent’s average reward per time step is a more natural performance measure compared to the commonly used discounting framework since it can better capture an agent’s long-term behavior. We derive a novel lower bound on the difference of the long-term average reward for two policies. The lower bound depends on the average divergence between the policies and on the so-called Kemeny constant, which measures to what degree the unichain Markov chains associated with the policies are well-connected. We also show that previous work based on the discounted return (Schulman et al., 2015; Achiam et al., 2017) results in a non-meaningful lower bound in the average reward setting. Based on our lower bound, we develop an iterative procedure which produces a sequence of monotonically improved policies for the average reward criterion. When combined with Deep Reinforcement Learning (DRL) methods, the procedure leads to scalable and efficient algorithms for maximizing the agent’s average reward performance.Empirically we demonstrate the effectiveness of our method on continuing control tasks and show how discounting can lead to unsatisfactory performance.
One-sentence Summary: Theoretically extend trust-region based methods to the average reward setting and empirically demonstrate their efficacy on continuing control tasks.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Supplementary Material: zip
Reviewed Version (pdf): https://openreview.net/references/pdf?id=ChHAhiBSMN
15 Replies

Loading