Keywords: Sparse Autoencoders, Mechanistic Interpretability, Identifiability, Sparse Dictionary Learning
TL;DR: We show a dissociation between linear and support identifiability in SAEs: different sparse coding algorithms find disjoint but equally valid feature sets, suggesting that mechanistic narratives based on a single encoder solution may be incomplete
Abstract: Sparse autoencoders (SAEs) are widely used in mechanistic interpretability, but
it is unclear whether the encoder’s sparse code is uniquely determined. We com-
pare SAE encoders against classical sparse coding algorithms (OMP, IHT) using
frozen dictionaries. We find that alternative methods select substantially different
features (Jaccard ∼ 0.43) while producing linearly equivalent codes (R2 > 0.88).
This dissociation between linear and support identifiability holds across layers and
SAE configurations. Our results suggest SAE features represent one valid decom-
position among alternatives, with implications for interpretability claims built on
specific features.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 79
Loading