Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning

Published: 16 Jan 2024, Last Modified: 10 Apr 2024ICLR 2024 posterEveryoneRevisionsBibTeX
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: model-free RL, self-play, memory efficiency, Q-learning, Nash equilibrium, Markov policy
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We design an algorithm that achieves memory, computational, and sample efficiency, and meanwhile has a low burn-in cost and outputs a Markov and Nash policy for two-player zero-sum Markov games.
Abstract: The thriving field of multi-agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample complexity on the long horizon and the large state space, the high computational complexity, non-Markov policy, non-Nash policy, and high burn-in cost. In this work, we take a step towards settling this problem by designing a model-free self-play algorithm \emph{Memory-Efficient Nash Q-Learning (ME-Nash-QL)} for two-player zero-sum Markov games, which is a specific setting of MARL. We prove that ME-Nash-QL can output an $\varepsilon$-approximate Nash policy with remarkable space complexity $O(SABH)$, sample complexity $\widetilde{O}(H^4SAB/\varepsilon^2)$, and computational complexity $O(T\mathrm{poly}(AB))$, where $S$ is the number of states, $\{A, B\}$ is the number of actions for the two players, $H$ is the horizon length, and $T$ is the number of samples. Notably, our approach outperforms in terms of space complexity compared to existing algorithms for tabular cases. It achieves the lowest computational complexity while preserving Markov policies, setting a new standard. Furthermore, our algorithm outputs a Nash policy and achieves the best sample complexity compared with the existing guarantee for long horizons, i.e. when $\min \\{ A, B \\} \ll H^2$. Our algorithm also achieves the best burn-in cost $O(SAB\,\mathrm{poly}(H))$, whereas previous algorithms need at least $O(S^3 AB\,\mathrm{poly}(H))$ to attain the same level of sample complexity with ours.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Primary Area: learning theory
Submission Number: 4956
Loading