Abstract: Large Language Models (LLMs) excel at producing fluent text yet remain prone to generating harmful or biased outputs, largely due to their opaque, “black-box” nature. Existing mitigation strategies, such as reinforcement learning from human feedback (RLHF) and instruction tuning, can reduce these risks but often demand extensive retraining and may not generalize. An alternative approach leverages sparse autoencoders (SAEs) to extract disentangled, interpretable representations from LLM activations, enabling the detection of specific semantic attributes without modifying the base model. In this work, we extend the Sparse Conditioned Autoencoder (SCAR) framework [Härle et al., 2024] to enable multi-attribute detection and steering. Our approach disentangles multiple semantic features—such as toxicity and style—in a unified latent space, providing granular, real-time control without compromising textual quality. Experimental results demonstrate that our multi-feature extension maintains the interpretability, safety, and quality of the original single-attribute SCAR while offering enhanced flexibility by allowing simultaneous control over multiple semantic attributes. Furthermore, evaluations under both black-box and white-box adversarial attack scenarios reveal that our approach remains robust, reinforcing its potential as a reliable and adaptable safety mechanism for LLMs.
Paper Type: Short
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: mechanistic interpretability, sparse autoencoders, jailbreaking, AI safety
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 6737
Loading