Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

ICLR 2025 Conference Submission12831 Authors

28 Sept 2024 (modified: 27 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mechanistic interpretability, sparse autoencoders, evaluations
TL;DR: We compute and validate *supervised* sparse feature dictionaries on the IOI task, and then compare SAEs against them
Abstract: Disentangling model activations into human-interpretable features is a central problem in interpretability. Sparse autoencoders (SAEs) have recently attracted much attention as a scalable unsupervised approach to this problem. However, our imprecise understanding of ground-truth features in realistic scenarios makes it difficult to measure the success of SAEs. To address this challenge, we propose to evaluate SAEs on specific tasks by comparing them to supervised feature dictionaries computed with knowledge of the concepts relevant to the task. Specifically, we suggest that it is possible to (1) compute supervised sparse feature dictionaries that disentangle model computations for a specific task; (2) use them to evaluate and contextualize the degree of disentanglement and control offered by SAE latents on this task. Importantly, we can do this in a way that is agnostic to whether the SAEs have learned the exact ground-truth features or a different but similarly useful representation. As a case study, we apply this framework to the indirect object identification (IOI) task using GPT-2 Small, with SAEs trained on either the IOI or OpenWebText datasets. We find that SAEs capture interpretable features for the IOI task, and that more recent SAE variants such as Gated SAEs and Top-K SAEs are competitive with supervised features in terms of disentanglement and control over the model. We also exhibit, through this setup and toy models, some qualitative phenomena in SAE training illustrating feature splitting and the role of feature magnitudes in solutions preferred by SAEs.
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12831
Loading