Explaining Metastable Cooperation in Independent Multi-Agent Boltzmann Q-Learning – A Deterministic Approximation

19 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Temporal difference learning, complex dynamics, frequency-adjusted Q-learning
Abstract: Multi-agent reinforcement learning involves interacting agents whose learning processes are coupled through a shared environment. This work introduces a new discrete-time approximation model for multi-agent Boltzmann Q-learning that accounts for agents' update frequencies. We demonstrate why previous models do not accurately represent the actual stochastic learning dynamics while our model can reproduce several complex emergent dynamic regimes, including transient cooperation and metastable states in social dilemmas like the Prisoner's Dilemma. We show that increasing the discount factor can prevent convergence by inducing oscillations through a supercritical Neimark-Sacker bifurcation, which transforms the unique stable fixed point into a stable limit cycle. This analysis provides a deeper understanding of the complexities of multi-agent learning dynamics and the conditions under which convergence and cooperation may not be achieved.
Primary Area: reinforcement learning
Submission Number: 18134
Loading