Keywords: Social Dilemmas, Opponent Shaping, MARL, LLMs
TL;DR: We propose an MARL algorithm to fine-tune LLMs to cooperate in social dilemmas, while being non-exploitable
Abstract: As agentic AI becomes more widespread, agents with distinct and possibly conflicting goals will interact in complex ways.
These multi-agent interactions pose a fundamental challenge, particularly in social dilemmas, where agents' individual incentives can undermine collective welfare.
While reinforcement learning (RL) has been effective for aligning large language models (LLMs) in the single-agent regime, prior small-network results suggest that standard RL in multi-agent games often converges to defecting, self-interested policies. We show the same effect in LLMs: despite cooperative priors, RL-trained LLM agents develop opportunistic behavior that can exploit even advanced closed-source models.
To address this tendency of RL to converge to poor equilibria, we build on an opponent-learning awareness algorithm, Advantage Alignment, to fine-tune LLMs toward multi-agent cooperation and non-exploitability.
Specifically, we derive a novel variant of Advantage Alignment under the assumption of non-observability of other players' actions on the current time step, resulting in $\textit{jit}$ Advantage Alignment.
We further introduce a group-relative baseline that simplifies advantage computation, enabling multi-agent training at LLM scale.
Agents fine-tuned with our method learn the well-known $\textit{tit-for-tat}$ strategy in the classic Iterated Prisoner's Dilemma.
In complex environments, our method achieves higher collective payoffs while remaining robust against exploitation by greedy agents.
Finally, we contribute a suite of social dilemma benchmarks to advance the study of cooperation in agentic AI.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21932
Loading