SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders

27 Sept 2024 (modified: 04 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mechanistic interpretability, Large language models, Sparse autoencoders, Sparse dictionary learning, Unsupervised learning, Interpretable AI
TL;DR: This paper introduces SAGE, a framework to scale realistic ground truth evaluations of sparse autoencoders to large, state-of-the-art language models such as Gemma-2-2B.
Abstract: A key challenge in interpretability is to decompose model activations into meaningful features. Sparse autoencoders (SAEs) have emerged as a promising tool for this task. However, a central problem in evaluating the quality of SAEs is the absence of ground truth features to serve as an evaluation gold standard. Current evaluation methods for SAEs are therefore confronted with a significant trade-off: SAEs can either leverage toy models or other proxies with predefined ground truth features; or they use extensive prior knowledge of realistic task circuits. The former limits the generalizability of the evaluation results, while the latter limits the range of models and tasks that can be used for evaluations. We introduce SAGE: Scalable Autoencoder Ground-truth Evaluation, an evaluation framework for SAEs that enables obtaining high-quality feature dictionaries for diverse tasks and feature distributions without relying on prior knowledge. Specifically, we lift previous limitations by showing that ground truth evaluations on realistic tasks can be automated and scaled. First, we show that we can automatically identify the cross-sections in the model where task-specific features are active. Second, we demonstrate that we can then compute the ground truth features at these cross-sections. Third, we introduce a novel reconstruction method which significantly reduces the amount of trained SAEs needed for the evaluation. This addresses scalability limitations in prior work and significantly simplifies the practical evaluations. We validate our results by evaluating SAEs on novel tasks on Pythia70M, GPT-2 Small, and Gemma-2-2B, thus demonstrating the scalability of our method to state-of-the-art open-source frontier models. These advancements pave the way for generalizable, large-scale evaluations of SAEs in interpretability research.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11231
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview