Self-Play $Q$-Learners Can Provably Collude in the Iterated Prisoner's Dilemma

Quentin Bertrand; Juan Agustin Duque; Emilio Calvano; Gauthier Gidel

Self-Play $Q$-Learners Can Provably Collude in the Iterated Prisoner's Dilemma

Quentin Bertrand, Juan Agustin Duque, Emilio Calvano, Gauthier Gidel

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We show that self-play Q-learners can learn to collude, even if their inital states and policies are always defect

Abstract: A growing body of computational studies shows that simple machine learning agents converge to cooperative behaviors in social dilemmas, such as collusive price-setting in oligopoly markets, raising questions about what drives this outcome. In this work, we provide theoretical foundations for this phenomenon in the context of self-play multi-agent Q-learners in the iterated prisoner’s dilemma. We characterize broad conditions under which such agents provably learn the cooperative Pavlov (win-stay, lose-shift) policy rather than the Pareto-dominated “always defect” policy. We validate our theoretical results through additional experiments, demonstrating their robustness across a broader class of deep learning algorithms.

Lay Summary: As artificial intelligence (AI) becomes more common in business and online marketplaces, some experts worry that smart computer programs might start cooperating in ways that hurt consumers—like raising prices together without directly communicating. This paper looks into that concern by studying how AI agents can learn to work together, even when they’re supposed to be competing. We focused on a classic decision-making scenario called the "prisoner's dilemma," which is often used to study cooperation. We found that when two AI agents repeatedly play this game and use a specific type of learning called "Q-learning," they can end up cooperating with each other. In fact, they often learn a cooperative strategy called "Pavlov" (or "win-stay, lose-shift"), where they stick with what works and change when it doesn’t. The key finding is that under certain learning conditions, these AI agents can consistently learn to cooperate—even without being explicitly programmed to do so. This shows that AI systems can develop cooperative behavior on their own, which could have important consequences for markets and regulations.

Primary Area: Theory->Game Theory

Keywords: iterated prisoner's dilemma, cooperation, Q-learning, self-play

Submission Number: 12897

Loading