Keywords: interpretability, language model, sae, features, explanation
TL;DR: We build an open-source automated pipeline to generate and evaluate natural language explanations for SAE features using LLMs
Abstract: While the activations of neurons in deep neural networks usually do not have a simple human-understandable interpretation, sparse autoencoders (SAEs) can be used to transform these activations into a higher-dimensional latent space which may be more easily interpretable. But these SAEs can have millions of distinct latent features, making it infeasible for humans to manually interpret each one. In this work, we build an open-source automated pipeline to generate and evaluate natural language explanations for SAE features using LLMs. We test our pipeline on SAEs of varying sizes, activation functions, and losses, trained on three different open-weight LLMs. We present new techniques to score the quality of explanations that are cheaper to run than the previous state of the art. We propose guidelines for how to generate better explanations that remain valid for a broader set of activating contexts, and discuss common pitfalls with the current scoring techniques. We also introduce a novel similarity metric for SAEs, and find that SAEs trained on nearby layers of the residual stream are much more similar than ones trained on adjacent MLPs. We anticipate that the open-source framework proposed can improve future evaluations of the interpretability of SAEs and will enable more work on this front.
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7275
Loading