Keywords: Social Dilemmas, Opponent Shaping, MARL, LLMs
TL;DR: We propose an MARL algorithm to fine-tune LLMs to cooperate in social dilemmas, while being non-exploitable
Abstract: As agentic AI becomes more widespread, agents with distinct and possibly conflicting goals will interact in complex ways.
These multi-agent interactions pose a fundamental challenge, particularly in social dilemmas, where agents' individual incentives can undermine collective welfare.
While reinforcement learning (RL) has been effective for aligning large language models (LLMs) in the single-agent regime, prior small-network results suggest that standard RL in multi-agent settings often converges to defecting, self-interested policies.
We show the same effect in LLMs: despite cooperative priors, RL-trained LLM agents develop opportunistic behavior that can exploit even advanced closed-source models.
To address this tendency of RL to converge to poor equilibria, we adapt a recent opponent-learning awareness algorithm, Advantage Alignment, to fine-tune LLMs toward multi-agent cooperation and non-exploitability.
We then introduce a group-relative baseline that simplifies advantage computation in iterated games, enabling multi-agent training at LLM scale.
We also contribute a novel social dilemma environment, Trust-and-Split, which requires natural language communication to achieve high collective welfare.
Across a wide range of social dilemmas, policies learned with Advantage Alignment achieve higher collective payoffs while remaining robust against exploitation by greedy agents.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21932
Loading