Keywords: formal languages, sparse autoencoders, interpretability
TL;DR: We formulate a synthetic testbed to stress-test the sparse autoencoder (SAE) approach to interpretability in the text domain, using formal languages.
Abstract: Sparse autoencoders (SAEs) have been central to the effort of finding interpretable and disentangled directions of representation spaces in neural networks, in both image and text domains. While the efficacy and pitfalls of this method in the vision domain are well-studied, there is a lack of corresponding results, both qualitative and quantitative, for the text domain. We define and train language models on a set of formal grammars, and train SAEs on the latent representations of these models under a wide variety of hyperparameter settings. We identify several interpretable latents in the SAEs, and formulate a scaling law defining the relationship between the reconstruction loss of SAEs and their hidden size. We show empirically that the presence of latents correlating to certain features of the input does not imply a causal function in the computation and that the performance of SAEs is highly sensitive to inductive biases.
Email Of Author Nominated As Reviewer: abhinav.m@research.iiit.ac.in
Submission Number: 17
Loading