Keywords: mechanistic interpretability, sparse autoencoders, evaluations
TL;DR: We compute and validate *supervised* sparse feature dictionaries on the IOI task, and then compare SAEs against them
Abstract: Disentangling model activations into human-interpretable features is a central
problem in interpretability. Sparse autoencoders (SAEs) have recently attracted
much attention as a scalable unsupervised approach to this problem. However, our
imprecise understanding of ground-truth features in realistic scenarios makes it
difficult to measure the success of SAEs. To address this challenge, we propose
to evaluate SAEs on specific tasks by comparing them to supervised
feature dictionaries computed with knowledge of the concepts relevant to the
task.
Specifically, we suggest that it is possible to (1) compute supervised sparse
feature dictionaries that disentangle model computations for a specific task;
(2) use them to evaluate and contextualize the degree of disentanglement and
control offered by SAE latents on this task. Importantly, we can do this in a
way that is agnostic to whether the SAEs have learned the exact ground-truth
features or a different but similarly useful representation.
As a case study, we apply this framework to the indirect object identification
(IOI) task using GPT-2 Small, with SAEs trained on either the IOI or OpenWebText
datasets. We find that SAEs capture interpretable features for the IOI task, and
that more recent SAE variants such as Gated SAEs and Top-K SAEs are competitive
with supervised features in terms of disentanglement and control over the model.
We also exhibit, through this setup and toy models, some qualitative phenomena
in SAE training illustrating feature splitting and the role of feature
magnitudes in solutions preferred by SAEs.
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12831
Loading