Keywords: AI alignment, Activation steering, Sparse representations, Sparse autoencoders (SAEs), Large language models (LLMs), Interpretability
TL;DR: Sparse activation steering (SAS) propose an steering method in sparse spaces to precisely control LLM behavior by isolating interpretable features.
Abstract: A key challenge in AI alignment is guiding large language models (LLMs) to follow desired behaviors at test time. Activation steering, which modifies internal model activations during inference, offers a promising solution. However, prior work in dense activation spaces struggles with $\textit{superposition}$, where multiple features become entangled, limiting interpretability and precise control. In contrast, sparse representations offer an untapped opportunity for more interpretable behavior modulation. In this work, we introduce $\textit{Sparse Activation Steering}$ (SAS), a novel method for steering LLM behavior in $\textit{sparse spaces}$. By isolating behavior-specific features (i.e., latent dimensions) through a contrastive prompt-pairing approach, we define a set of features that can selectively reinforce or suppress behaviors. Experiments on Gemma 2 LLMs show that SAS vectors enable steering on par with its dense counterpart while offering interpretability advantages such as easier compositionality of features in these spaces. Furthermore, our scaling studies on sparse latents reveal a trend toward greater sparsity in SAS vectors, approaching ideal $\textit{monosemanticity}$.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 1749
Loading