Keywords: mechanistic interpretability, sparse autoencoders
TL;DR: We propose a principled methodology to assess feature decompositions of sparse autoencoders
Abstract: A major open problem in mechanistic interpretability is disentangling internal
model activations into meaningful features, with recent work focusing on sparse
autoencoders (SAEs) as a potential solution. However, verifying that an SAE has
found the `right' features in realistic settings has been difficult, as we don't
know the (hypothetical) ground-truth features to begin with. In the absence of
such ground truth, current evaluation metrics are indirect and rely on proxies,
toy models, or other non-trivial assumptions.
To overcome this, we propose a new framework to evaluate SAEs: studying how
pre-trained language models perform specific tasks, where model activations can
be (supervisedly) disentangled in a principled way that allows precise control
and interpretability. We develop a task-specific comparison of learned SAEs to
our supervised feature decompositions that is \emph{agnostic} to whether the
SAE learned the same exact set of features as our supervised method. We
instantiate this framework in the indirect object identification (IOI) task on
GPT-2 Small, and report on both successes and failures of SAEs in this setting.
Submission Number: 106
Loading