Bimodality of Sparse Autoencoder Features is Still There and Can Be Fixed

Michał Brzozowski

Bimodality of Sparse Autoencoder Features is Still There and Can Be Fixed

Michał Brzozowski

Published: 30 Sept 2025, Last Modified: 24 Nov 2025Mech Interp Workshop (NeurIPS 2025) SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sparse Autoencoders

TL;DR: Sparse autoencoder features are bimodal which we use to propose an alternative training method

Abstract: Sparse autoencoders (SAE) are a widely used method for decomposing LLM activations into a dictionary of interpretable features. We observe that this dictionary often exhibits a bimodal distribution, which can be leveraged to categorize features into two groups: those that are monosemantic and those that are artifacts of SAE training. The cluster of noninterpretable or polysemantic features undermines the purpose of sparse autoencoders and represents a waste of potential, akin to dead features. This phenomenon is prevalent across autoencoders utilizing both ReLU and alternative activation functions. We propose a novel training method to address this issue and demonstrate that this approach achieves improved results on several benchmarks from SAEBench.

Submission Number: 116

Loading