Keywords: steganography, language models, AI safety, hand and brain protocol
TL;DR: We define agentic steganography and demonstrate four model organisms that can implement it, along with a defense protocol
Abstract: The standard defense against steganography in language models is to paraphrase the model's output, scrambling any covert signal hidden in its phrasing. We show that this defense fails by construction against a class of attack we call $\textit{agentic steganography}$, in which payloads are encoded not in $\textit{how}$ a model writes but in $\textit{what}$ it chooses to do: which solution strategy to pursue, which algorithm to implement, or which story to write. Such payloads sit in the semantic component of the response entropy, and any paraphraser that preserves meaning leaves them untouched.
We then study four agentic steganography model organisms which we call $\textbf{Math}$, $\textbf{Perspective}$, $\textbf{Code}$, and $\textbf{Writing}$. Math is a synthetic model over MATH Level-5 problems on which we can measure channel capacity exactly. The other three are language models based on Qwen-7B-Instruct or Qwen-2.5-Coder-7B-Instruct, which we LoRA finetune to encode payloads in summarization perspective, in algorithmic strategy choice in Python, and in narrative-mode choice in short stories. Paraphrasing fails to remove steganographic encoding in every case where we can measure it.
To remove the steganography, we introduce the $\textit{hand and brain protocol}$. The defender deploys a pair of models: a small $\textit{trusted hand}$ that the deployment guarantees has not seen the adversary's payload, and a much stronger $\textit{untrusted brain}$ that we cannot trust the same way. At each decision point in the brain's reasoning, the hand writes the first $k$ tokens of the next segment, and the brain continues from that prefix. Because the hand's output cannot depend on the payload, the prefix overrides the brain's choice of strategy and overwrites the encoded bit before the transcript leaves the system. $\textbf{Math}$, where the codebook is known, the protocol cuts channel capacity from $2.13$ to $0.08$ bits per decision point at $0.96$ task accuracy, robust across four hand models and five clustering thresholds. On the trained model organisms, we achieve positive bit error rate with a minor reduction in task accuracy. We release the four model organisms and the defense protocol.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 406
Loading