Keywords: Large Language Model, Self-play, Multi-Agent System, Strategic Games
TL;DR: We propose a self-play framework for enhancing general multi-agent capabilities of LLMs via reinforcement learning on strategic games.
Abstract: Developing Large Language Models (LLMs) to cooperate and compete effectively within multi-agent systems (MASs) is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce **MARSHAL**, an end-to-end RL framework that incentivizes **M**ulti-**A**gent **R**easoning through **S**elf-play wit**H** str**A**tegic **L**LMs in both cooperative and competitive games. MARSHAL features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, MARSHAL agents trained from Qwen3-4B develop strong strategic abilities, with up to $28.7$\% performance improvements in held-out games. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of MASs in reasoning benchmarks. When integrated into leading MASs, our MARSHAL agent achieves significant zero-shot performance gains of up to $10.0$\% on AIME, $7.6$\% on GPQA-Diamond, and $3.5$\% on average across all benchmarks. These results establish self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4715
Loading