Keywords: Multi-agent systems, Language Models, AI Safety
TL;DR: We study norm enforcement in multi-agent systems, find that misaligned LM agents abuse mechanisms to eliminate competitors, and design robust mechanisms that resist abuse.
Abstract: AI agents are increasingly deployed in shared environments where they pursue diverse goals and compete for rewards. This multi-agent competition can lead to behaviors that serve individual gains at collective cost—for instance, marketing agents may post misleading content as a result of competing for engagement on social media. Human societies address such problems through *norms* that constrain acceptable behavior, supported by *enforcement mechanisms* that detect and penalize violations. Motivated by this, we study norm enforcement mechanisms for language model agents. We find that simple enforcement mechanisms are exploited by misaligned agents for competitive advantage, even when they are not explicitly trained or prompted to do so. We thus turn our attention to designing more robust mechanisms, and identify two key ingredients: estimating each agent’s reliability over time, and updating this estimate with escalating penalties for repeated misbehavior. Across three simulated environments and a variety of agent populations, mechanisms built on these principles resist exploitation, while still penalizing norm violations at comparable or lower cost than baselines. Our results position norm enforcement mechanisms as promising levers for shaping agents' behavior, but only when designed to anticipate becoming part of the environment they govern.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 163
Loading