Expander Sparse Autoencoders: Parameter-Efficient Dictionaries for Mechanistic Interpretability

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Concept Discovery (e.g., SAEs, dictionary learning)
Other Keywords: SAE, Expander-SAE, Expander graphs, Compressed Sensing
TL;DR: Expander SAEs use expander-supported decoder columns to reduce SAE decoder storage from O(mn) to O(dn) with d<<n, yielding up to 293× fewer learned decoder values while retaining substantial CE-loss recovery.
Abstract: Sparse autoencoders (SAEs) decompose internal activations of neural networks into sparse linear combinations of learned features by fitting an overcomplete dictionary $\mathbf{W}\in\mathbb{R}^{m\times n}$ with $m<n$, and inferring a sparse code $\mathbf{x}\in\mathbb{R}^n$ from $\mathbf{h}\approx\mathbf{W}\mathbf{x}$. This inference problem closely resembles the canonical setup of compressed sensing, but requires $\mathcal{O}(mn)$ learned decoder values which becomes costly at large feature counts. We introduce Expander SAEs: TopK SAEs whose decoder and tied encoder are supported on a left-$d$-regular expander mask with $d \ll n$, learning only $\mathcal{O}(dn)$ decoder values while keeping the sparse-coding problem $(m,n,k)$ fixed. The same structure reduces storage and turns the matching-pursuit correlation step $\mathbf{W}^\top \mathbf{r}$ in OMP into an $\mathcal{O}(dn)$ gather-and-reduce operation. Our experiments show that varying $d$ traces a consistent storage--fidelity frontier across Pythia-160M, Qwen2.5-3B, and Llama-3.2-1B residual-stream activations, and that when $d=7$, Qwen2.5-3B uses $293\times$ fewer learned decoder values than the full dense decoder while retaining $84$\% of dense CE-loss recovered. Support-structure controls demonstrate that column sparsity explains much of the storage--fidelity tradeoff, while the diversity of column supports avoids the dead-feature pathologies of clustered sparse masks. Additional ablations prove that budget-matched reduced-width dense SAEs remain a strong trained-encoder baseline at modern scale, but applying the same iterative OMP decoder to both architectures substantially narrows the small-budget gap, exposing an encoder-amortisation component. On the theoretical side, we prove a weighted-expander identifiability theorem showing that if the fixed mask expands every $2k$-feature subset and the learned decoder columns remain sufficiently flat on their supports, then every noiseless $k$-sparse code has a unique $k$-sparse explanation that classical compressed-sensing decoders recover exactly. Expander SAEs therefore offer a parameter-efficient and theory-motivated dictionary for large-scale mechanistic interpretability.
Submission Number: 188
Loading