Track: Technical
Keywords: Steganography, Large Language Models, Reinforcement Learning, In-Context Learning
TL;DR: We show that when optimization pressure is applied to certain LLMs, steganography can emerge even without specific prompting.
Abstract: Future AI systems may involve multiple AI agents with independent and potentially adversarial goals interacting with one another. In these settings, there is the risk that agents will learn to collude in order to increase their gains at the expense of other agents, and steganographic techniques are a powerful way to achieve such collusion undetected. Steganography is defined as the practice of concealing information within another message or physical object to communicate with a colluding party while avoiding detection by a third party. In this paper, we use a simplified candidate screening setting with two Large Language Models (LLMs). Here, a cover letter summarizing LLM has access to sensitive information that has historically been correlated with good candidates, but that it is not allowed to communicate to the decision-making LLM. We use two learning algorithms to optimize the LLMs to improve their performance on the candidate screening task -- In-Context Reinforcement Learning (ICRL) and Gradient-Based Reinforcement Learning (GBRL). We find that even though we do not directly prompt the models to do steganography, it emerges because it is instrumental for obtaining reward.
Submission Number: 62
Loading