Leveraging Sparse Autoencoders for Passive Scoping
Keywords: ML, AI, Artificial Intelligence, Machine Learning, Sparse Autoencoder, SAE, Representation, Anomaly Detection, AI Safety, AIS, Interpretability, Mechanistic Interpretability, AI Security, Jailbreak, Jailbreaks, Trojan, Trojans, Data poisoning, Adversarial Robustness, AI Alignment, Alignment, Post-training, Finetuning, RLHF, Unlearning, Unsupervised, Representation-learning
TL;DR: We leverage a whitebox method to filter out off-task LLM activations, increasing safety from Jailbreaks and Trojans and thereby introducing a new paradigm in AI Security/Safety.
Abstract: The general-purpose nature of Large language models (LLMs) presents a serious challenge for adversarial robustness. Currently, most approaches to defend LLMs from jailbreak and data poisoning attacks rely on explicitly training against known attacks and behaviors. However, this places a burden on model developers, because they cannot anticipate all such attacks and behaviors. To solve this problem, we implement the principle of least privilege (PoLP). We propose that model developers specify the knowledge and capabilities an AI system should retain and restrict all others by default. We call this type of approach passive scoping. This paper characterizes and evaluates the usage of representation filtering methods under three realistic environments and two types of attacks: jailbreaks and trojans. Our SAE-enhanced method pareto-dominates baselines on the tradeoff between in-domain utility and OOD safety. We showcase that unlike existing methods, passive scoping does not require knowledge of unwanted inputs nor outputs. Our results suggest that by leveraging PoLP model developers could increase safety from unknown unknowns. At the end of the paper the analysis probes into where and why our methods work best.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 19915
Loading