**Keywords:**Reinforcement learning theory, Markov decision process, stochastic approximation**TL;DR:**This paper provides sharper finite-time analysis for double Q-learning with improved convergence rate over all major parameters.**Abstract:**Double Q-learning (Hasselt, 2010) has gained significant success in practice due to its effectiveness in overcoming the overestimation issue of Q-learning. However, the theoretical understanding of double Q-learning is rather limited. The only existing finite-time analysis was recently established in (Xiong et al. 2020), where the polynomial learning rate adopted in the analysis typically yields a slower convergence rate. This paper tackles the more challenging case of a constant learning rate, and develops new analytical tools that improve the existing convergence rate by orders of magnitude. Specifically, we show that synchronous double Q-learning attains an $\epsilon$-accurate global optimum with a time complexity of $\tilde{\Omega}\left(\frac{\ln D}{(1-\gamma)^7\epsilon^2} \right)$, and the asynchronous algorithm achieves a time complexity of $\tilde{\Omega}\left(\frac{L}{(1-\gamma)^7\epsilon^2} \right)$, where $D$ is the cardinality of the state-action space, $\gamma$ is the discount factor, and $L$ is a parameter related to the sampling strategy for asynchronous double Q-learning. These results improve the existing convergence rate by the order of magnitude in terms of its dependence on all major parameters $(\epsilon,1-\gamma, D, L)$. This paper presents a substantial step toward the full understanding of the fast convergence of double-Q learning.**Supplementary Material:**pdf**Code Of Conduct:**I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

11 Replies

Loading