Release Opt Out: No, I don't wish to opt out of paper release. My paper should be released.
Keywords: language models, mechanistic interpretability, steering vectors, sparse autoencoders
Abstract: Sparse autoencoders (SAEs) have emerged as a promising method for
    disentangling large language model (LLM) activations into
    human-interpretable features. However, evaluating SAEs remains challenging,
    as current metrics like reconstruction loss and sparsity provide limited
    insight into whether the extracted features are meaningful and useful.
    We propose a novel evaluation methodology that measures SAEs' effectiveness
    at controlling open-ended text generation. Using the Tiny Stories dataset
    and models as a minimal yet realistic testbed, we develop an automated
    pipeline to extract steering vectors for concepts in an unsupervised way. We
    then evaluate how well these vectors can control text generation compared to
    SAE latents.
    Our results show that individual SAE latents can often improve upon the
    Pareto front between steering success and generation coherence compared to
    supervised steering vectors. This suggests that SAEs can learn meaningful,
    disentangled features useful for model control, providing evidence for their
    effectiveness beyond standard reconstruction metrics.
Submission Number: 55
Loading