SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: SAEBench, a comprehensive suite of benchmarks for sparse autoencoders (SAEs) in language model interpretability
Abstract: Sparse autoencoders (SAEs) are a popular technique for interpreting language model activations, and there is extensive recent work on improving SAE effectiveness. However, most prior work evaluates progress using unsupervised proxy metrics with unclear practical relevance. We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across eight diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning. To enable systematic comparison, we open-source a suite of over 200 SAEs across seven recently proposed SAE architectures and training algorithms. Our evaluation reveals that gains on proxy metrics do not reliably translate to better practical performance. For instance, while Matryoshka SAEs slightly underperform on existing proxy metrics, they substantially outperform other architectures on feature disentanglement metrics; moreover, this advantage grows with SAE scale. By providing a standardized framework for measuring progress in SAE development, SAEBench enables researchers to study scaling trends and make nuanced comparisons between different SAE architectures and training methodologies. Our interactive interface enables researchers to flexibly visualize relationships between metrics across hundreds of open-source SAEs at www.neuronpedia.org/sae-bench
Lay Summary: Researchers often struggle to understand how large language models, like GPT, internally represent language. Sparse autoencoders (SAEs) are tools designed to help make sense of these internal representations by identifying meaningful patterns ("features") within the model's activations. Recently, several new SAE designs have emerged, each claiming different strengths. However, comparing these methods is tricky because they are usually evaluated on simplified metrics whose practical value is unclear. We introduce SAEBench, a new, comprehensive benchmark that evaluates SAEs across eight diverse metrics, covering interpretability, disentanglement (clearly separating different features), and real-world tasks like selectively removing information ("unlearning"). We open-source more than 200 SAEs spanning seven prominent architectures to enable systematic comparisons. Interestingly, we find approaches that seem weaker under traditional simplified evaluations, like the Matryoshka SAE, actually perform substantially better on our broader suite of practical metrics—in particular, Matryoshka SAEs excel at clearly disentangling different features, a crucial capability that improves further as models scale. SAEBench thus helps researchers understand SAE strengths and limitations in realistic scenarios, driving meaningful progress in interpretability research.
Link To Code: https://github.com/adamkarvonen/SAEBench
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: Sparse autoencoders, mechanistic interpretability, evaluations
Submission Number: 8955
Loading