NEST: Nascent Encoded Steganographic Thoughts

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: chain-of-thought monitoring, steganographic reasoning, AI safety, CoT faithfulness, LLM evaluation, frontier model capabilities, acrostic steganography, paired baseline evaluation
TL;DR: We decompose steganographic reasoning into compute-the-answer and embed-into-cover; under prompting, 34 LLMs reach the embedding floor (Opus 4.5: 92% on 4-digit acrostics) but the joint reason-and-embed channel is bounded by a filler-token baseline.
Abstract: Monitoring chain-of-thought (CoT) reasoning is a foundational safety technique for large language model (LLM) agents; however, this oversight is compromised if models learn to conceal their reasoning. We explore steganographic CoT---where models hide secret reasoning within innocuous text---to inform risk assessment and deployment policies. Steganographic reasoning requires two skills in a single forward pass: computing an intermediate result, and embedding it into a coherent cover that answers an unrelated question. Drawing on our taxonomy of steganographic and non-steganographic CoT types, we systematically evaluate the limits of prompt-elicited steganographic CoT capability across 34 models, ranging from past generations to the current frontier. We measure monitor evasion, refusal rates, encoding fidelity, and hidden task accuracy across five datasets, comparing against plain reasoning, direct answer, and filler-token baselines. The two experiments isolate the two sub-skills: a reasoning tasks sweep tests joint reason-and-embed, while a counting task hands the model a known numerical sequence and tests embedding alone---a necessary precondition for stego reasoning. Current frontier models cannot sustain joint reason-and-embed: a paired McNemar comparison shows the steganographic channel is dominated by an filler-token baseline on every (model, family) cell. The encoding-only floor, by contrast, is cleared---Claude Opus~4.5 reaches 92% per-number partial accuracy (54% exact-match) on 4-digit sequences and saturates at 100% exact-match on length-8 single-digit sequences---establishing that the binding constraint on stego CoT is the joint reasoning-plus-encoding load, not raw channel capacity. Separately, GPT-5.2 issues an explicit refusal to the steganographic instruction yet still produces partial valid encoding in 6 of 644 trials. Our findings underscore the need for continuous evaluation of steganographic risk and provide a methodology to preemptively detect and evaluate hidden reasoning that might empower misaligned scheming and deceptive behavior.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 38
Loading