Keywords: Sparse Autoencoders, Dictionary Learning, Language Model Features, Scaling Laws, Mechanistic Interpretability
TL;DR: We study SAE scaling by breaking SAE error into that which is linearly explainable from activations and that which is not.
Abstract: Sparse autoencoders (SAEs) are a promising technique for decomposing language model activations into interpretable linear features. However, current SAEs fall short of completely explaining model performance, resulting in ``dark matter''—unexplained variance in activations. In this work, we predict and verify that much of SAE dark matter can be linearly predicted from the activation vector. We exploit this fact to deconstruct dark matter into three top-level components: 1) unlearned linear features, 2) unlearned dense features, and 3) nonlinear errors introduced by the SAE. Through a scaling laws analysis, we estimate that nonlinear SAE errors stay constant as SAEs scale and serve as a lower bound of SAE performance on both an average and per-token level. We next empirically analyze the nonlinear SAE error term and show that it is not entirely a sparse sum of unlearned linear features, but that it is still responsible for some of the downstream reduction in cross entropy loss when SAE activations are inserted back into the model. Finally, we examine two methods to reduce nonlinear error: inference time gradient pursuit, which leads to a very slight decrease in nonlinear error, and linear transformations from earlier layer SAE dictionaries, which leads to a larger reduction.
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3966
Loading