Keywords: Circuit learning;Mechanistic interpretability;Sparse autoencoders
TL;DR: We present a scalable circuit learning method for large language models that efficiently uncovers relationships among components and SAE features, enhances interpretability, and improves efficiency for downstream tasks.
Abstract: A prominent research direction within mechanistic interpretability involves learning sparse circuits to model causal relationships between LLM components, thereby providing insights into model behavior. However, due to the polysemantic nature of LLM components, learned circuits are often difficult to interpret. While sparse autoencoder (SAE) features enhance interpretability, their high dimensionality presents a significant challenge for existing circuit learning methods to scale. To address these limitations, we propose a scalable circuit learning approach, CircuitLasso, that leverages sparse linear regression. Our method can efficiently uncover relationships among SAE features, showing how human-interpretable semantic features propagate through the model and influence its predictions. We empirically evaluate our method against state-of-the-art baselines on benchmark circuit learning tasks, demonstrating substantial improvements in efficiency while accurately capturing circuits involving LLM components. Given its efficiency, we then apply our method to SAE (high dimensional) features and obtain human-interpretable circuits for a grammatical classification task that has not been studied before in mechanistic interpretation. Finally, we validate the utility of our learned circuits by leveraging their insights to improve downstream performance in domain generalization.
Primary Area: interpretability and explainable AI
Submission Number: 9942
Loading