Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models

Aaron Dharna; Cong Lu; Jeff Clune

Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models

Aaron Dharna, Cong Lu, Jeff Clune

Published: 09 May 2025, Last Modified: 09 May 2025RLC 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: open-ended learning, self-play, quality-diversity, foundation models, policy search

TL;DR: This work proposes Foundation-Model Self-Play as a new research direction, merging self-play with foundation models for open-ended strategy discovery in multi-agent games.

Abstract: Multi-agent interactions have long fueled innovation, from natural predator-prey dynamics to the space race. Self-play (SP) algorithms try to harness these dynamics by pitting agents against ever-improving opponents, thereby creating an implicit curriculum. However, SP often fails to produce diverse solutions and can get stuck in locally optimal behaviors. We introduce Foundation-Model Self-Play (FMSP), a new direction that leverages the code-generation capabilities and vast knowledge of foundation models (FMs) to overcome these challenges by leaping across optima in policy space. We propose a family of approaches: (1) Vanilla FMSP (vFMSP) continually refines and improves an agent’s policy via competitive self-play; (2) Novelty-Search Self-Play (NSSP) builds a diverse population of strategies, ignoring performance; and (3) the most promising variant, Quality-Diversity Self-Play (QDSP), creates a diverse set of high-quality policies by combining elements of both NSSP and vFMSP. We evaluate FMSPs in a continuous-control pursuer-evader setting (Car Tag) and in “Gandalf,” a simple AI safety simulation in which an attacker tries to jailbreak an LLM’s defenses. In Car Tag, our algorithms explore a wide variety of reinforcement learning, tree search, and heuristic-based methods, to name just a few. In terms of discovered policy quality, QDSP and vFMSP find policies that surpass strong human-designed strategies. In Gandalf, our algorithms can successfully automatically red-team an LLM, breaking through and jailbreaking six different, progressively stronger levels of defense. Furthermore, FMSPs enable us to automatically close the loop and rapidly patch the discovered vulnerabilities. Overall, FMSP and its many possible variants represent a promising new research frontier of improving self-play with foundation models, opening fresh paths toward more creative and open-ended strategy discovery.

Submission Number: 26

Loading