Which Sparse Code? Identifiability Failures in SAE Inference

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Autoencoders, Mechanistic Interpretability, Identifiability, Sparse Dictionary Learning
TL;DR: We show a dissociation between linear and support identifiability in SAEs: different sparse coding algorithms find disjoint but equally valid feature sets, suggesting that mechanistic narratives based on a single encoder solution may be incomplete
Abstract: Sparse autoencoders (SAEs) are widely used in mechanistic interpretability, but it is unclear whether the encoder’s sparse code is uniquely determined. We com- pare SAE encoders against classical sparse coding algorithms (OMP, IHT) using frozen dictionaries. We find that alternative methods select substantially different features (Jaccard ∼ 0.43) while producing linearly equivalent codes (R2 > 0.88). This dissociation between linear and support identifiability holds across layers and SAE configurations. Our results suggest SAE features represent one valid decom- position among alternatives, with implications for interpretability claims built on specific features.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 79
Loading