The parameters in weight-sparse transformers are interpretable

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Automated interpretability, Benchmarking Interpretability, Methods (probing, steering, causal interventions)
Other Keywords: Weight-based interpretability, weight-sparse transformers
TL;DR: We empirically show that a large fraction of nonzero weights in weight-sparse transformers admit short, human-readable explanations of when they matter, far above dense transformers.
Abstract: A central goal of mechanistic interpretability is to understand how neural networks work, and what each individual component does. Dominant circuit-finding approaches focus on a specific behavior and reverse-engineer the role of components on the associated sub-distribution. Past work has shown however, that components can have different functions that are active on different subsets of the input distribution. In this work we test whether it is possible to understand individual weights globally, on the full training distribution. We focus on weight-sparse transformers in which we expect individual weights to be more interpretable than dense models. Here, we introduce introduce an automated LLM-pipeline that produces a short, human-readable account of when a given weight matters, verifies this account on held-out data, and applies it at scale to compare two weight-sparse transformers against two dense models. Empirically, we find that a significant percentage of nonzero weights on sparse transformers are interpretable (17-35\%), compared to 5-9\% on dense models. Our results are a proof of concept that a substantial fraction of language model weights can be interpretable, and confirms that the weights of sparse models are more interpretable that those of dense models.
Submission Number: 538
Loading