Abstract: Non-stationarity is a fundamental challenge in multi-agent reinforcement learning (MARL), where agents update their behaviour as they learn. In multi-agent settings, individual agents may have an incomplete view of the actions of others, which can complicate the learning process. Many theoretical advances in MARL avoid the challenge of non-stationarity by coordinating the policy updates of agents in various ways, including synchronizing times at which agents are allowed to revise their policies. In this paper, we study an asynchronous variant of the decentralized Q-learning algorithm, a recent MARL algorithm for stochastic games. We provide sufficient conditions under which the asynchronous algorithm drives play to equilibrium with high probability. In this generalization, players need not agree on the schedule of policy update times, and may change their policies at their own separately selected times. This work extends the applicability of the decentralized Q-learning algorithm to settings in which parameters are selected in an independent manner, and tames non-stationarity without imposing the coordination assumptions of prior work.
Loading