Keywords: sparse autoencoder, mechanistic interpretability, explainable ai
TL;DR: We show that features of the sparse autoencoders often exhibits a bimodal distribution which can be exploited to achieve state-of-the-art autoencoders for model interpretability.
Abstract: Sparse autoencoders (SAE) are a widely used method for decomposing LLM activations into a dictionary of interpretable features. We observe that this dictionary often exhibits a bimodal distribution, which can be leveraged to categorize features into two groups: those that are monosemantic and those that are artifacts of SAE training. The cluster of noninterpretable or polysemantic features undermines the purpose of sparse autoencoders and represents a waste of potential, akin to dead features. This phenomenon is prevalent across autoencoders utilizing both ReLU and alternative activation functions. We propose a novel training method to address this issue and demonstrate that this approach achieves improved results on several benchmarks from SAEBench.
Primary Area: interpretability and explainable AI
Submission Number: 9227
Loading