Keywords: Average-reward Q-learning, adaptive stepsizes, non-asymptotic analysis, non-Markovian stochastic approximation.
TL;DR: This work presents the first finite-time analysis of average-reward Q-learning with asynchronous updates based on a single trajectory of Markovian samples.
Abstract: This work presents the first finite-time analysis of average-reward $Q$-learning with an asynchronous implementation. A key feature of the algorithm we study is the use of adaptive stepsizes that act as local clocks for each state-action pair. We show that the mean-square error of this $Q$-learning algorithm, measured in the span seminorm, converges at a rate of $\tilde{\mathcal O}(1/k)$. Technically, the use of adaptive stepsizes causes each $Q$-learning update to depend on the full sample history, introducing strong correlations and making the algorithm a non-Markovian stochastic approximation (SA) scheme. Our approach to overcoming this challenge involves (1) a time-inhomogeneous Markovian reformulation of non-Markovian SA, and (2) a combination of almost-sure time-varying bounds, conditioning arguments, and Markov chain concentration inequalities to break the strong correlations between the adaptive stepsizes and the iterates.
Submission Number: 1
Loading