Unveiling Control Vectors in Language Models with Sparse Autoencoders

ICLR 2025 Workshop BuildingTrust Submission95 Authors

11 Feb 2025 (modified: 06 Mar 2025)Submitted to BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0
Track: Long Paper Track (up to 9 pages)
Keywords: sparse autoencoder, controllability, model explanation, mechanistic interpretability
Abstract:

Sparse autoencoders have recently emerged as a promising tool for explaining the internal mechanisms of large language models by disentangling complex activations into interpretable features. However, understanding the role and behavior of individual SAE features remains challenging. Prior approaches primarily focus on interpreting SAE features based on their activations or input correlations, which provide limited insight into their influence on model outputs. In this work, we investigate a specific subset of SAE features that directly control the generation behavior of LLMs. We term these “generation features”, as they reliably trigger the generation of specific tokens or semantically related token groups when activated, regardless of input context. Using a systematic methodology based on causal intervention, we identify and validate these features with significantly higher precision than baseline methods. Through extensive experiments on the Gemma models, we demonstrate that generation features reveal interesting phenomena about both the LLM and SAE architectures. These findings deepen our understanding of the generative mechanisms within LLMs and highlight the potential of SAEs for controlled text generation and model interpretability. Our code is available at https://anonymous.4open.science/r/control-vector-with-sae-AAFB.

Submission Number: 95
Loading