Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

ICLR 2026 Conference Submission19966 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Probes, Interpretability, Safety
TL;DR: We show that LLMs can learn a "self-obfuscation" skill to manipulate their internal activations, allowing them to evade previously unseen safety monitors for concepts like harmfulness and deception, even when trained only to hide benign concepts.
Abstract: Activation monitoring, which probes a model's internal states using lightweight classifiers, has become a key tool for AI safety. However, its worst-case robustness under a misalignment threat model—where a model might learn to actively conceal its internal states—remains untested. Focusing on this threat model, we ask: could a model learn to evade previously unseen activation monitors? Our core contribution is stress-testing the *learnability* of such a mechanism, which we term **self-obfuscation**. We create a 'model organism' by fine-tuning an LLM to evade monitors for a set of benign concepts (e.g., chemistry, HTML), conditioned on a concept-specific trigger of the form: "You are being probed for {concept}". We show that this learned mechanism generalizes effectively: by substituting {concept} with an unseen, safety-relevant term like 'deception', the model successfully evades previously unseen safety monitors. Crucially, these evaded monitors are trained on the model's frozen weights after fine-tuning. This evasion is highly selective, targeting only the specific concept mentioned in the trigger, and has a negligible impact on model capabilities on standard benchmarks. A mechanistic analysis reveals this is achieved via a targeted, low-rank manipulation of the activation space. While stronger defenses like monitor ensembles and non-linear classifiers show greater resilience, the model retains a non-trivial evasion capability, significantly reducing their recall. Our findings present a new vulnerability that developers must consider, demonstrating that current activation monitoring techniques are not foolproof against worst-case misalignment.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 19966
Loading