A Perfect Collusion Benchmark: How can AI agents be prevented from colluding with information-theoretic undetectability?

Sumeet Ramesh Motwani; Mikhail Baranchuk; Lewis Hammond; Christian Schroeder de Witt

A Perfect Collusion Benchmark: How can AI agents be prevented from colluding with information-theoretic undetectability?

Sumeet Ramesh Motwani, Mikhail Baranchuk, Lewis Hammond, Christian Schroeder de Witt

Published: 31 Oct 2023, Last Modified: 29 Nov 2023MASEC@NeurIPS'23 WiPPEveryoneRevisionsBibTeX

Keywords: AI Safety, Collusion, Information Hiding, Steganography, LLMs, GenAI, Reinforcement Learning

TL;DR: New benchmark for studying perfectly undetectable collusion between generative AI models.

Abstract: Secret collusion among advanced AI agents is widely considered a significant risk to AI safety. In this paper, we investigate whether LLM agents can learn to collude undetectably through hiding secret messages in their overt communications. To this end, we implement a variant of Simmon's prisoner problem using LLM agents and turn it into a stegosystem by leveraging recent advances in perfectly secure steganography. We suggest that our resulting benchmark environment can be used to investigate how easily LLM agents can learn to use perfectly secure steganography tools, and how secret collusion between agents can be countered pre-emptively through paraphrasing attacks on communication channels. Our work yields unprecedented empirical insight into the question of whether advanced AI agents may be able to collude unnoticed.

Submission Number: 23

Loading