Sparse Autoencoders are Capable LLM Jailbreak Mitigators

Yannick Assogba; Jacopo cortellazzi; Javier Abad; Pau Rodriguez; Xavier Suau; Arno Blaas

Sparse Autoencoders are Capable LLM Jailbreak Mitigators

Yannick Assogba, Jacopo cortellazzi, Javier Abad, Pau Rodriguez, Xavier Suau, Arno Blaas

Published: 11 Jun 2026, Last Modified: 19 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Methods (probing, steering, causal interventions), Interpretability for AI Safety, Applications of interpretability

TL;DR: We propose a method that utilizes SAE latents to perform LLM jailbreak mitigation. Our method performs as effectively or better than steering in the original dense activation space.

Abstract: Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and thirteen jailbreak attacks, CC-Delta achieves comparable or better safety–utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation. Our results suggest off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-specific training.

Submission Number: 359

Loading