Large language models can learn and generalize steganographic chain-of-thought under process supervision

Robert McCarthy; Joey SKAF; Luis Ibanez-Lissen; Vasil Georgiev; Connor Watts; Hannes Whittingham; Lorena Gonzalez-Manzano; Cameron Tice; Edward James Young; Puria Radmard; David Lindner

Large language models can learn and generalize steganographic chain-of-thought under process supervision

Robert McCarthy, Joey SKAF, Luis Ibanez-Lissen, Vasil Georgiev, Connor Watts, Hannes Whittingham, Lorena Gonzalez-Manzano, Cameron Tice, Edward James Young, Puria Radmard, David Lindner

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI Safety, AI Control, Steganography, Encoded Reasoning, Chain-of-Thought, Reinforcement Learning, LLMs

TL;DR: We show that penalizing certain CoT reasoning makes LLMs learn encoding schemes that generalize to unseen examples.

Abstract: Chain-of-thought (CoT) reasoning not only enhances large language model performance but also provides critical insights into decision-making processes, marking it as a useful tool for monitoring model intent and planning. By proactively preventing models from acting on CoT indicating misaligned or harmful intent, CoT monitoring can be used to reduce risks associated with deploying models. However, developers may be incentivized to train away the appearance of harmful intent from CoT traces, by either customer preferences or regulatory requirements. However, recent works have shown that banning the mention of a specific example of reward hacking causes obfuscation of the undesired reasoning traces but the persistence of the undesired behavior, threatening the reliability of CoT monitoring. However, obfuscation of reasoning can be due to its internalization to latent space computation, or its encoding within the CoT. We provide an extension to these results with regard to the ability of models to learn a specific type of obfuscated reasoning: steganography. First, we show that penalizing the use of specific strings within load-bearing reasoning traces causes models to substitute alternative strings. Crucially, this does not alter the underlying method by which the model performs the task, demonstrating that the model can learn to steganographically encode its reasoning. This is an example of models learning to encode their reasoning. We further demonstrate that models can generalize an encoding scheme. When the penalized strings belong to an overarching class, the model learns not only to substitute strings seen in training, but also develops a general encoding scheme for all members of the class which it can apply to held-out testing strings.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 28295

Loading