Agent Properties for Multi-Agent Safety

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0
Keywords: multi-agent safety, cooperative AI, agent evaluations
TL;DR: We argue that evaluating specific agent properties, rather than simulating canonical cooperation problems, offers a more tractable path toward actionable results for multi-agent AI safety
Abstract: Cooperation failures in multi-agent interactions could lead to catastrophic outcomes even among aligned AI agents. Classic cooperation problems such as the Prisoner's Dilemma or the Tragedy of the Commons have been useful for illustrating and exploring this challenge, but toy experiments with current language models cannot provide robust evidence for how advanced agents will behave in real-world settings. To better understand how to prevent cooperation failures among AI agents we propose a shift in focus from simulating canonical scenarios from game theory to studying specific agent properties. This should include both individual properties observable in isolation and interactive properties that only manifest in relation to other agents. If we can (1) evaluate to what extent relevant properties are present in agents and (2) understand how those properties influence outcomes in multi-agent interactions, this provides a path towards actionable results that could inform agent design and regulation.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 126
Loading