A Perfect Collusion Benchmark: How can AI agents be prevented from colluding with information-theoretic undetectability?

Published: 31 Oct 2023, Last Modified: 29 Nov 2023MASEC@NeurIPS'23 WiPPEveryoneRevisionsBibTeX
Keywords: AI Safety, Collusion, Information Hiding, Steganography, LLMs, GenAI, Reinforcement Learning
TL;DR: New benchmark for studying perfectly undetectable collusion between generative AI models.
Abstract: Secret collusion among advanced AI agents is widely considered a significant risk to AI safety. In this paper, we investigate whether LLM agents can learn to collude undetectably through hiding secret messages in their overt communications. To this end, we implement a variant of Simmon's prisoner problem using LLM agents and turn it into a stegosystem by leveraging recent advances in perfectly secure steganography. We suggest that our resulting benchmark environment can be used to investigate how easily LLM agents can learn to use perfectly secure steganography tools, and how secret collusion between agents can be countered pre-emptively through paraphrasing attacks on communication channels. Our work yields unprecedented empirical insight into the question of whether advanced AI agents may be able to collude unnoticed.
Submission Number: 23
Loading