Learning Robust Social Strategies with Large Language Models

Learning Robust Social Strategies with Large Language Models

ICLR 2026 Conference Submission21932 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Social Dilemmas, Opponent Shaping, MARL, LLMs

TL;DR: We propose an MARL algorithm to fine-tune LLMs to cooperate in social dilemmas, while being non-exploitable

Abstract: As agentic AI becomes more widespread, agents with distinct and possibly conflicting goals will interact in complex ways. These multi-agent interactions pose a fundamental challenge, particularly in social dilemmas, where agents' individual incentives can undermine collective welfare. While reinforcement learning (RL) has been effective for aligning large language models (LLMs) in the single-agent regime, prior small-network results suggest that standard RL in multi-agent games often converges to defecting, self-interested policies. We show the same effect in LLMs: despite cooperative priors, RL-trained LLM agents develop opportunistic behavior that can exploit even advanced closed-source models. To address this tendency of RL to converge to poor equilibria, we build on an opponent-learning awareness algorithm, Advantage Alignment, to fine-tune LLMs toward multi-agent cooperation and non-exploitability. Specifically, we derive a novel variant of Advantage Alignment under the assumption of non-observability of other players' actions on the current time step, resulting in $\textit{jit}$ Advantage Alignment. We further introduce a group-relative baseline that simplifies advantage computation, enabling multi-agent training at LLM scale. Agents fine-tuned with our method learn the well-known $\textit{tit-for-tat}$ strategy in the classic Iterated Prisoner's Dilemma. In complex environments, our method achieves higher collective payoffs while remaining robust against exploitation by greedy agents. Finally, we contribute a suite of social dilemma benchmarks to advance the study of cooperation in agentic AI.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 21932

Loading