Enforcing Orderedness in SAEs to Improve Feature Consistency

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Autoencoders, Foundational work
Other Keywords: Matryoshka SAEs, Hierarchical Models, Ordered Autoencoders, Consistent Feature Learning
TL;DR: OSAE deterministically orders and uses all latent features, theoretically resolving permutation non-identifiability in identifiable sparse dictionary learning and empirically improving feature consistency over Matryoshka SAEs on toy & language data.
Abstract: Sparse autoencoders (SAEs) have been widely used for interpretability of neural networks, but their learned features often vary across seeds and hyperparameter settings. We introduce Ordered Sparse Autoencoders (OSAE), which extend Matryoshka SAEs by (1) establishing a strict ordering of latent features and (2) deterministically using every feature dimension, avoiding the sampling‐based approximations of prior nested SAE methods. Theoretically, we show that OSAEs resolve permutation non-identifiability in settings of sparse dictionary learning where solutions are unique (up to natural symmetries). Empirically on Gemma2-2B and Pythia-70M, we show that OSAEs can help improve consistency compared to Matryoshka baselines.
Submission Number: 271
Loading