Secret Collusion among AI Agents: Multi-Agent Deception via Steganography

Sumeet Ramesh Motwani; Mikhail Baranchuk; Martin Strohmeier; Vijay Bolina; Philip Torr; Lewis Hammond; Christian Schroeder de Witt

Secret Collusion among AI Agents: Multi-Agent Deception via Steganography

Sumeet Ramesh Motwani, Mikhail Baranchuk, Martin Strohmeier, Vijay Bolina, Philip Torr, Lewis Hammond, Christian Schroeder de Witt

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Collusion, AI Safety, Steganography, Large Language Models, Model Evaluation Framework, Multi-Agent Security, Security, Frontier Models, GenAI, AI Control

TL;DR: A comprehensive formalization of steganographic collusion in decentralised systems of generative AI agents and a model evaluation framework

Abstract: Recent advancements in generative AI suggest the potential for large-scale interaction between autonomous agents and humans across platforms such as the internet. While such interactions could foster productive cooperation, the ability of AI agents to circumvent security oversight raises critical multi-agent security problems, particularly in the form of unintended information sharing or undesirable coordination. In our work, we establish the subfield of secret collusion, a form of multi-agent deception, in which two or more agents employ steganographic methods to conceal the true nature of their interactions, be it communicative or otherwise, from oversight. We propose a formal threat model for AI agents communicating steganographically and derive rigorous theoretical insights about the capacity and incentives of large language models (LLMs) to perform secret collusion, in addition to the limitations of threat mitigation measures. We complement our findings with empirical evaluations demonstrating rising steganographic capabilities in frontier single and multi-agent LLM setups and examining potential scenarios where collusion may emerge, revealing limitations in countermeasures such as monitoring, paraphrasing, and parameter optimization. Our work is the first to formalize and investigate secret collusion among frontier foundation models, identifying it as a critical area in AI Safety and outlining a comprehensive research agenda to mitigate future risks of collusion between generative AI systems.

Primary Area: Safety in machine learning

Submission Number: 7633

Loading