TL;DR: We introduce Matryoshka SAEs to mitigate feature splitting and absorption in large SAEs. By training nested dictionaries to enforce a feature hierarchy, our method improves feature disentanglement and performance on downstream interpretability tasks.
Abstract: Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting neural networks by extracting the concepts represented in their activations. However, choosing the size of the SAE dictionary (i.e. number of learned concepts) creates a tension: as dictionary size increases to capture more relevant concepts, sparsity incentivizes features to be split or absorbed into more specific features, leaving high-level features missing or warped. We introduce Matryoshka SAEs, a novel variant that addresses these issues by simultaneously training multiple nested dictionaries of increasing size, forcing the smaller dictionaries to independently reconstruct the inputs without using the larger dictionaries. This organizes features hierarchically - the smaller dictionaries learn general concepts, while the larger dictionaries learn more specific concepts, without incentive to absorb the high-level features. We train Matryoshka SAEs on Gemma-2-2B and TinyStories and find superior performance on sparse probing and targeted concept erasure tasks, more disentangled concept representations, and reduced feature absorption. While there is a minor tradeoff with reconstruction performance, we believe Matryoshka SAEs are a superior alternative for practical tasks, as they enable training arbitrarily large SAEs while retaining interpretable features at different levels of abstraction.
Lay Summary: Modern AI systems can be like black boxes and we don't know what concepts they've learned internally. Researchers use tools called sparse autoencoders to peek inside and extract interpretable concepts, but these tools have a major flaw: when we try to capture more detailed concepts, they lose track of broader categories. For example, a system might learn about specific punctuation marks like periods and commas but forget the general concept of "punctuation marks" altogether.
We developed Matryoshka Sparse Autoencoders, named after Russian nesting dolls. Our system trains several sparse autoencoders at once, nested inside each other with levels of increasing detail. The smaller, core "dolls" are trained to capture the most general, high-level concepts. Then, the larger, surrounding "dolls" learn more specific details, building upon the general concepts without overwriting or damaging them.
This layered approach allows Matryoshka Sparse Autoencoders to represent both the forest (big-picture ideas) and the trees (fine-grained details) effectively. As a result, researchers can get a more complete and interpretable view of how an AI model understands information at multiple levels of abstraction. This ultimately helps researchers in getting a better understanding of how powerful AI systems work, which may be crucial to ensure that they are safe.
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: sparse autoencoders, mechanistic interpretability
Submission Number: 7724
Loading