Keywords: Concept Discovery (e.g., SAEs, dictionary learning)
Other Keywords: Identifiability, Theory
TL;DR: We study optimality constraints for dictionary learning solutions, and use this to understand the behaviour of representations in Sparse Autoencoders.
Abstract: Sparse Autoencoders (SAEs) have found widespread success parsing neural representations into interpretable concepts, providing a basis for understanding and control. However, what exactly an SAE extracts, and, correspondingly, the scientific conclusions we can draw, is not obvious. Empirically, the proof is in the pudding: SAEs do learn interpretable features. Theoretically, we lack a clear account of what properties a `concept' must satisfy for an SAE to extract it. There is an extensive body of work studying sparse coding identifiability; in particular, given data generated under sparsity assumptions, when will an algorithm recover the true factors? However, SAEs are trained on internet-swallowing representations that are poorly approximated by simple generative models. Rather than assuming a hypothesised ground truth, we ask what properties any dictionary learning optimum must satisfy without data-assumptions. Concretely, we extend existing local optimality analyses to the nonnegative joint-optimisation problem that vanilla SAEs approximate, and derive constraints relating optimal SAE feature's to their distributions. We use these to explain a range of observed SAE behaviours - hierarchical splitting \& absorption, the structure of residuals, and dense antipodal features - each reflecting how L1+nonnegativity interact with data to structure optimal dictionaries. Further, we identify a novel convex formulation of the problem, and use it to ask: will larger SAEs ever stop splitting? We find the answer can be yes, with a limiting dictionary state that clusters data along rays. In sum, we hope this framework can tease model assumptions from unexpected observations, letting us learn more from SAEs' successes.
Submission Number: 438
Loading