Interpretable Guardrails: Jailbreak Detection and Debugging with Sparse Autoencoder Features

Interpretable Guardrails: Jailbreak Detection and Debugging with Sparse Autoencoder Features

ACL ARR 2026 January Submission2795 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models, jailbreak attack, sparse autoencoder

Abstract: This paper presents an interpretable jailbreak detection system that enables mechanistic debugging of safety classifiers without retraining. The paper leverages Sparse Autoencoder (SAE) features from intermediate transformer layers to decompose model decisions into semantically meaningful components. By using linear classifiers on sparse feature activations and applying statistical tests to discover causal failure mechanisms, the approach achieves comparable performance with state-of-the-art (SOTA) methods, outperforming the specialized LlamaGuard-3-8B guardrail with 16 times fewer parameters while approaching dense hidden state performance. The paper introduces a three-stage debugging framework that discovers functional feature roles, validates causal influence through activation patching, and applies targeted intervention to recover false negatives. The framework successfully reduced false negatives by 33\%, improving recall by 3 percentage points while limiting precision loss to just one point. By operating at inference time with negligible computational overhead, the approach enables rapid adaptation to emerging failure modes without costly retraining cycles, demonstrating the practical value of interpretability for safety-critical systems.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: Large language models, jailbreak attack, sparse autoencoder

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 2795

Loading