Keywords: Sparse autoencoders
TL;DR: We improve upon prior work to show that SAEs can perform on par with supervised features for model steering on IOI.
Abstract: Sparse autoencoders (SAEs) have attracted atten-
tion as a way towards unsupervised disentangling
of hidden LLM activations into meaningful fea-
tures. However, evaluations of SAE architectures
and training algorithms have so far been indi-
rect due to the difficulty – both conceptual and
technical – of obtaining ‘ground truth’ features
to compare against. To overcome this, recent
work (Makelov et al., 2024) has proposed a suite
of SAE evaluations that compare SAE features
against feature dictionaries learned with super-
vision for a specific model capability. However,
the evaluations were implemented in a mostly ex-
ploratory way, and did not optimize for eliciting
best SAE performance across different SAE vari-
ants.
While initial results are promising, they rely on qualitative
and/or indirect evaluation of the learned features such as
proxies for the ‘true’ features, non-trivial assumptions about
SAE learning or success in toy models (Elhage et al., 2022;
Bricken et al., 2023; Sharkey et al., 2023). As a step to-
wards more objective SAE evaluations, recently Makelov
et al. (2024) proposed to use sparse feature dictionaries
learned with supervision in the context of a given model
capability (specifically, the IOI task (Wang et al., 2023)) as
a ‘skyline’ for achievable SAE performance w.r.t. this capa-
bility. They developed several evaluations that (1) confirm
the supervised features provide a high-quality decomposi-
tion of model computations w.r.t the capability and (2) use
these supervised features to contextualize SAE results, for
SAEs trained on distributions of either capability-specific
or internet text.
In this work, we improve upon this by running
a systematic and thorough study of using SAEs
for steering on the IOI task, comparing several
recently proposed SAE variants: ‘vanilla’ SAEs
(Bricken et al., 2023), gated SAEs (Rajamanoha-
ran et al., 2024) and topK SAEs (Gao et al., 2024).
We find that, even by employing a simple and
cheap heuristic for choosing good SAEs for edit-
ing, we are able to greatly improve upon the re-
sults of prior work, and demonstrate that SAE
features are able to perform on par with super-
vised feature dictionaries. Further, we find that
topK SAEs and gated SAEs generally outperform
other variants on this test, and topK SAEs can
almost match supervised features in terms of edit
quality.
Submission Number: 141
Loading