Keywords: Finite Sample Analysis, Average-Reward Reinforcement Learning, TD Learning with Linear function Approximation, Q-Learning
Abstract: The focus of this paper is on sample complexity guarantees of average-reward reinforcement learning algorithms, which are known to be more challenging to study than their discounted-reward counterparts. To the best of our knowledge, we provide the first known finite sample guarantees using both constant and diminishing step sizes of (i) average-reward TD($\lambda$) with linear function approximation for policy evaluation and (ii) average-reward $Q$-learning in the tabular setting to find the optimal policy. A major challenge is that since the value functions are agnostic to an additive constant, the corresponding Bellman operators are no longer contraction mappings under any norm. We obtain the results for TD($\lambda$) by working in an appropriately defined subspace that ensures uniqueness of the solution. For $Q$-learning, we exploit the span seminorm contractive property of the Bellman operator, and construct a novel Lyapunov function obtained by infimal convolution of a generalized Moreau envelope and the indicator function of a set.
Supplementary Material: pdf
Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.