TL;DR: We propose JSAEs, an architecture aimed at discovering sparse computational graphs in a fully unsupervised way
Abstract: Sparse autoencoders (SAEs) have been successfully used to discover sparse and human-interpretable representations of the latent activations of language models (LLMs). However, we would ultimately like to understand the computations performed by LLMs and not just their representations. The extent to which SAEs can help us understand computations is unclear because they are not designed to “sparsify” computations in any sense, only latent activations. To solve this, we propose Jacobian sparse autoencoders (JSAEs), which yield not only sparsity in the input and output activations of a given model component but also sparsity in the computation (formally, the Jacobian) connecting them. With a naïve implementation, the Jacobians in LLMs would be computationally intractable due to their size. Our key technical contribution is thus finding an efficient way of computing Jacobians in this setup. We find that JSAEs extract a relatively large degree of computational sparsity while preserving downstream LLM performance approximately as well as traditional SAEs. We also show that JSAEs achieve a greater degree of computational sparsity on pre-trained LLMs than on the equivalent randomized LLM. This shows that the sparsity of the computational graph appears to be a property that LLMs learn through training, and suggests that JSAEs might be more suitable for understanding learned transformer computations than standard SAEs.
Lay Summary: Modern language models are powerful but their internal “wiring” is so densely connected that humans struggle to see how one idea influences another. Existing sparse auto-encoders—tools that compress the model’s signals into simpler pieces—tidy up those signals, yet the hidden wiring itself remains tangled. We introduce Jacobian Sparse Autoencoders (JSAEs), a new method that encourages most of the hidden connections to fade out, spotlighting only the truly important ones. The trick is to make the Jacobian—the mathematical map linking each input feature to each output feature—mostly zeros, and we show how to compute this map efficiently even inside very large models. On several GPT-2-style networks, JSAEs significantly cut the number of active connections while keeping language performance virtually unchanged. When we applied the same procedure to a randomly re-initialised copy of each model, the wiring stayed dense, suggesting that sparsity is something real models learn, not a fluke of our method. Because the surviving connections form a small, readable circuit, researchers can more easily trace cause and effect inside the model. This clarity is a step toward safer, more transparent AI systems and provides a new foundation for fine-grained audits of model behaviour.
Link To Code: https://github.com/lucyfarnik/jacobian-saes
Primary Area: Social Aspects->Accountability, Transparency, and Interpretability
Keywords: Mechanistic Interpretability, Sparse Autoencoders, Superposition, Circuits, AI Safety
Submission Number: 10383
Loading