BatchTopK Sparse Autoencoders

Published: 10 Oct 2024, Last Modified: 09 Nov 2024SciForDL PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We introduce BatchTopK, a novel architecture of sparse autoencoder, that outperforms TopK SAEs
Abstract: Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting language model activations by decomposing them into sparse, interpretable features. A popular approach is the TopK SAE, that uses a fixed number of the most active latents per sample to reconstruct the model activations. We introduce BatchTopK SAEs, a training method that improves upon TopK SAEs by relaxing the top-k constraint to the batch-level, allowing for a variable number of latents to be active per sample. BatchTopK SAEs consistently outperform TopK SAEs at reconstructing activations from GPT-2 Small and Gemma 2 2B. BatchTopK SAEs achieve comparable reconstruction performance to the state-of-the-art JumpReLU SAE, but have the advantage that the average number of latents can be directly specified, rather than approximately tuned through a costly hyperparameter sweep. We provide code for training and evaluating these BatchTopK SAEs at [redacted].
Style Files: I have used the style files.
Submission Number: 22
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview