Leveraging Sparse Autoencoders for Passive Scoping

Adriano Hernandez; Dylan Hadfield-Menell

Leveraging Sparse Autoencoders for Passive Scoping

Adriano Hernandez, Dylan Hadfield-Menell

19 Sept 2025 (modified: 03 Oct 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: ML, AI, Artificial Intelligence, Machine Learning, Sparse Autoencoder, SAE, Representation, Anomaly Detection, AI Safety, AIS, Interpretability, Mechanistic Interpretability, AI Security, Jailbreak, Jailbreaks, Trojan, Trojans, Data poisoning, Adversarial Robustness, AI Alignment, Alignment, Post-training, Finetuning, RLHF, Unlearning, Unsupervised, Representation-learning

TL;DR: We leverage a whitebox method to filter out off-task LLM activations, increasing safety from Jailbreaks and Trojans and thereby introducing a new paradigm in AI Security/Safety.

Abstract: The general-purpose nature of Large language models (LLMs) presents a serious challenge for adversarial robustness. Currently, most approaches to defend LLMs from jailbreak and data poisoning attacks rely on explicitly training against known attacks and behaviors. However, this places a burden on model developers, because they cannot anticipate all such attacks and behaviors. To solve this problem, we implement the principle of least privilege (PoLP). We propose that model developers specify the knowledge and capabilities an AI system should retain and restrict all others by default. We call this type of approach passive scoping. This paper characterizes and evaluates the usage of representation filtering methods under three realistic environments and two types of attacks: jailbreaks and trojans. Our SAE-enhanced method pareto-dominates baselines on the tradeoff between in-domain utility and OOD safety. We showcase that unlike existing methods, passive scoping does not require knowledge of unwanted inputs nor outputs. Our results suggest that by leveraging PoLP model developers could increase safety from unknown unknowns. At the end of the paper the analysis probes into where and why our methods work best.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 19915

Loading