Multi-Agent Off-Policy TDC with Near-Optimal Sample and Communication Complexities

03 Apr 2022, 16:30 (modified: 30 Jun 2022, 11:19)Accepted by TMLREveryoneRevisionsBibTeX
Abstract: The finite-time convergence of off-policy temporal difference (TD) learning has been comprehensively studied recently. However, such a type of convergence has not been established for off-policy TD learning in the multi-agent setting, which covers broader reinforcement learning applications and is fundamentally more challenging. This work develops a decentralized TD with correction (TDC) algorithm for multi-agent off-policy TD learning under Markovian sampling. In particular, our algorithm avoids sharing the actions, policies and rewards of the agents, and adopts mini-batch sampling to reduce the sampling variance and communication frequency. Under Markovian sampling and linear function approximation, we proved that the finite-time sample complexity of our algorithm for achieving an $\epsilon$-accurate solution is in the order of $\mathcal{O}\big(\frac{M\ln\epsilon^{-1}}{\epsilon(1-\sigma_2)^2}\big)$, where $M$ denotes the total number of agents and $\sigma_2$ is a network parameter. This matches the sample complexity of the centralized TDC. Moreover, our algorithm achieves the optimal communication complexity $\mathcal{O}\big(\frac{\sqrt{M}\ln\epsilon^{-1}}{1-\sigma_2}\big)$ for synchronizing the value function parameters, which is order-wise lower than the communication complexity of the existing decentralized TD(0). Numerical simulations corroborate our theoretical findings.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Michael_Bowling1
Submission Number: 23