Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Published: 04 Mar 2024, Last Modified: 14 Apr 2024SeT LLM @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: mechanistic interpretability, sparse autoencoders
TL;DR: We propose a principled methodology to assess feature decompositions of sparse autoencoders
Abstract: A major open problem in mechanistic interpretability is disentangling internal model activations into meaningful features, with recent work focusing on sparse autoencoders (SAEs) as a potential solution. However, verifying that an SAE has found the `right' features in realistic settings has been difficult, as we don't know the (hypothetical) ground-truth features to begin with. In the absence of such ground truth, current evaluation metrics are indirect and rely on proxies, toy models, or other non-trivial assumptions. To overcome this, we propose a new framework to evaluate SAEs: studying how pre-trained language models perform specific tasks, where model activations can be (supervisedly) disentangled in a principled way that allows precise control and interpretability. We develop a task-specific comparison of learned SAEs to our supervised feature decompositions that is \emph{agnostic} to whether the SAE learned the same exact set of features as our supervised method. We instantiate this framework in the indirect object identification (IOI) task on GPT-2 Small, and report on both successes and failures of SAEs in this setting.
Submission Number: 106
Loading