Release Opt Out: No, I don't wish to opt out of paper release. My paper should be released.
Keywords: language models, mechanistic interpretability, steering vectors, sparse autoencoders
Abstract: Sparse autoencoders (SAEs) have emerged as a promising method for
disentangling large language model (LLM) activations into
human-interpretable features. However, evaluating SAEs remains challenging,
as current metrics like reconstruction loss and sparsity provide limited
insight into whether the extracted features are meaningful and useful.
We propose a novel evaluation methodology that measures SAEs' effectiveness
at controlling open-ended text generation. Using the Tiny Stories dataset
and models as a minimal yet realistic testbed, we develop an automated
pipeline to extract steering vectors for concepts in an unsupervised way. We
then evaluate how well these vectors can control text generation compared to
SAE latents.
Our results show that individual SAE latents can often improve upon the
Pareto front between steering success and generation coherence compared to
supervised steering vectors. This suggests that SAEs can learn meaningful,
disentangled features useful for model control, providing evidence for their
effectiveness beyond standard reconstruction metrics.
Submission Number: 55
Loading