Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task

Aleksandar Makelov

Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task

Aleksandar Makelov

Published: 24 Jun 2024, Last Modified: 31 Jul 2024ICML 2024 MI Workshop SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sparse autoencoders

TL;DR: We improve upon prior work to show that SAEs can perform on par with supervised features for model steering on IOI.

Abstract: Sparse autoencoders (SAEs) have attracted atten- tion as a way towards unsupervised disentangling of hidden LLM activations into meaningful fea- tures. However, evaluations of SAE architectures and training algorithms have so far been indi- rect due to the difficulty – both conceptual and technical – of obtaining ‘ground truth’ features to compare against. To overcome this, recent work (Makelov et al., 2024) has proposed a suite of SAE evaluations that compare SAE features against feature dictionaries learned with super- vision for a specific model capability. However, the evaluations were implemented in a mostly ex- ploratory way, and did not optimize for eliciting best SAE performance across different SAE vari- ants. While initial results are promising, they rely on qualitative and/or indirect evaluation of the learned features such as proxies for the ‘true’ features, non-trivial assumptions about SAE learning or success in toy models (Elhage et al., 2022; Bricken et al., 2023; Sharkey et al., 2023). As a step to- wards more objective SAE evaluations, recently Makelov et al. (2024) proposed to use sparse feature dictionaries learned with supervision in the context of a given model capability (specifically, the IOI task (Wang et al., 2023)) as a ‘skyline’ for achievable SAE performance w.r.t. this capa- bility. They developed several evaluations that (1) confirm the supervised features provide a high-quality decomposi- tion of model computations w.r.t the capability and (2) use these supervised features to contextualize SAE results, for SAEs trained on distributions of either capability-specific or internet text. In this work, we improve upon this by running a systematic and thorough study of using SAEs for steering on the IOI task, comparing several recently proposed SAE variants: ‘vanilla’ SAEs (Bricken et al., 2023), gated SAEs (Rajamanoha- ran et al., 2024) and topK SAEs (Gao et al., 2024). We find that, even by employing a simple and cheap heuristic for choosing good SAEs for edit- ing, we are able to greatly improve upon the re- sults of prior work, and demonstrate that SAE features are able to perform on par with super- vised feature dictionaries. Further, we find that topK SAEs and gated SAEs generally outperform other variants on this test, and topK SAEs can almost match supervised features in terms of edit quality.

Submission Number: 141

Loading