Language Models can Learn High-Capacity Secure Steganography

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0
Keywords: steganography, language models, AI safety, model organism
TL;DR: We show that language model organisms can be trained to implement perfectly undetectable steganographic schemes
Abstract: Steganography hides messages inside innocuous output so that an observer cannot tell a secret is being sent, and has become a concrete safety concern for language models deployed under monitoring. The worst case for a defender is a model whose output looks statistically identical to its normal generation, yet hides as much secret information per token as information theory allows. Whether a language model can be trained to do this, and what signatures such a model would leave for monitors, has not been characterized. We resolve the feasibility question by construction with MEC-LLM: a few trainable transformer layers appended to a frozen LLM and trained end-to-end against the minimum-entropy coupling (MEC) objective, which is the maximum-capacity scheme among perfectly-undetectable stegosystems. The trained model encodes uniformly random 128-bit secrets at roughly 1.8 bits per token, and produces text that a classifier over text features only reaches AUC 0.62 against cover. Internalization is not free for the attacker: a classifier on the model's full output logits separates stego from cover at AUC 1.00. We show that paraphrasing the stegotext with Claude Haiku 4.5 destroys the channel: across 13 paraphrase prompts spanning minimal-edit to structural rewrite, bit-error rate rises from 12.8% to 48.7%. Finally, we introduce a complete model organism that composes MEC-LLM with a trained payload generator to exfiltrate secrets planted in the system prompt, giving AI safety research a controlled target for studying internalized covert channels. We release this model organism for future work on monitoring and interpretability of covert communication in language models.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 452
Loading