TL;DR: This work trains sparse autoencoders over dense document-level text embeddings and studies their scaling laws, interpretability, relationships, and geometric structure
Abstract: Sparse autoencoders (SAEs) show promise in extracting interpretable features from complex neural networks, enabling examination and causal intervention in the inner workings of black-box models. However, the geometry and completeness of SAE features is not fully understood, limiting their interpretability and usefulness. In this work, we train SAEs to detangle dense text embeddings into highly interpretable document-level features. Our SAEs follow precise scaling laws as a function of capacity and compute, and exhibit higher interpretability scores compared to SAEs trained on language model activations. In embedding SAEs, we reproduce qualitative ``feature splitting" phenomena previously reported in language model SAEs, and demonstrate the existence of universal, cross-domain features. Finally, we suggest the existence of ``feature families" in SAEs, and develop a method to reveal distinct hierarchical clusters of related semantic concepts and map feature co-activations to a sparse block diagonal.
Style Files: I have used the style files.
Submission Number: 3
Loading