Evaluating Sparse Autoencoders for Controlling Open-Ended Text Generation

NeurIPS 2024 Workshop ATTRIB Submission55 Authors

Published: 30 Oct 2024, Last Modified: 14 Jan 2025ATTRIB 2024EveryoneRevisionsBibTeXCC BY 4.0
Release Opt Out: No, I don't wish to opt out of paper release. My paper should be released.
Keywords: language models, mechanistic interpretability, steering vectors, sparse autoencoders
Abstract: Sparse autoencoders (SAEs) have emerged as a promising method for disentangling large language model (LLM) activations into human-interpretable features. However, evaluating SAEs remains challenging, as current metrics like reconstruction loss and sparsity provide limited insight into whether the extracted features are meaningful and useful. We propose a novel evaluation methodology that measures SAEs' effectiveness at controlling open-ended text generation. Using the Tiny Stories dataset and models as a minimal yet realistic testbed, we develop an automated pipeline to extract steering vectors for concepts in an unsupervised way. We then evaluate how well these vectors can control text generation compared to SAE latents. Our results show that individual SAE latents can often improve upon the Pareto front between steering success and generation coherence compared to supervised steering vectors. This suggests that SAEs can learn meaningful, disentangled features useful for model control, providing evidence for their effectiveness beyond standard reconstruction metrics.
Submission Number: 55
Loading