BatchTopK Sparse Autoencoders

Bart Bussmann; Patrick Leask; Neel Nanda

BatchTopK Sparse Autoencoders

Bart Bussmann, Patrick Leask, Neel Nanda

Published: 10 Oct 2024, Last Modified: 09 Nov 2024SciForDL PosterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce BatchTopK, a novel architecture of sparse autoencoder, that outperforms TopK SAEs

Abstract: Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting language model activations by decomposing them into sparse, interpretable features. A popular approach is the TopK SAE, that uses a fixed number of the most active latents per sample to reconstruct the model activations. We introduce BatchTopK SAEs, a training method that improves upon TopK SAEs by relaxing the top-k constraint to the batch-level, allowing for a variable number of latents to be active per sample. BatchTopK SAEs consistently outperform TopK SAEs at reconstructing activations from GPT-2 Small and Gemma 2 2B. BatchTopK SAEs achieve comparable reconstruction performance to the state-of-the-art JumpReLU SAE, but have the advantage that the average number of latents can be directly specified, rather than approximately tuned through a costly hyperparameter sweep. We provide code for training and evaluating these BatchTopK SAEs at [redacted].

Style Files: I have used the style files.

Submission Number: 22

Loading