Keywords: identifiable representation learning, identifiability, sparse autoencoders, interpretability, safety, disentanglement, sparsity
TL;DR: We provably learn steering vectors from flexible multi-concept shift data with a sparse autoencoding framework.
Abstract: Unsupervised approaches to large language model (LLM) interpretability such as sparse autoencoders (SAEs), offer a way to decode LLM activations into interpretable, and ideally, controllable concepts. On one hand, these approaches alleviate the need for supervision from concept labels, paired prompts, or explicit knowledge of a high-level causal model. On the other hand, without additional assumptions, SAEs are not guaranteed to be identifiable. In practice, they may learn latent dimensions that entangle multiple underlying concepts. If we use these dimensions to extract vectors for steering specific LLM behaviours, this non-identifiability risks interventions that inadvertently affect unrelated properties. In this paper, we bring the question of identifiability to the forefront of LLM interpretability research. Specifically, we introduce Sparse Shift Autoencoders (*SSAE*s) that instead map the \textit{differences} between embeddings to sparse representations. Crucially, we show that *SSAE*s are identifiable from paired observations that vary in *multiple unknown concepts*. With this key identifiability result, we show that we can steer single concepts with only weak supervision. Finally, we empirically demonstrate identifiable concept recovery across multiple real-world language datasets by disentangling activations from different LLMs.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 21155
Loading