Keywords: Sparse Autoencoders, Dictionary Learning, Interpretability
TL;DR: We show that Sparse Autoencoders (SAEs) are inherently biased toward detecting only a subset of concepts in model activations shaped by their internal assumptions, highlighting the need for concept geometry-aware design of novel SAE architectures.
Abstract: Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations.
We show that each SAE imposes structural assumptions about how concepts are encoded in model representations, which in turn shapes what it can and cannot detect. We train SAEs on synthetic data with specific structure to show that SAEs fail to recover concepts when their assumptions are ignored, and we design a new SAE---called SpaDE---that enables the discovery of previously hidden concepts (those with heterogenous intrinsic dimensionality and nonlinear separation boundaries) and reinforces our theoretical insights.
Code: zip
Submission Number: 81
Loading