Keywords: Applications of interpretability, Interpretability tooling and software
Other Keywords: parameter,decomposition
TL;DR: We extend recent Parameter Decomposition work to make it better on Transformers and shows its applicability with a new toy model and GPT-2.
Abstract: Recent work in mechanistic interpretability has proposed decomposing model parameters rather than activations.
We extend Stochastic Parameter Decomposition (SPD) to Transformer models, proposing an updated causal importance function suited for sequential data.
We demonstrate that SPD can successfully decompose a toy induction-head model and recover the underlying computations.
We also show that applying SPD to GPT-2-small can successfully locate subcomponents corresponding to interpretable concepts like "golf" and "basketball".
This work takes the first step in the direction of extending SPD to modern models, and shows that we can use the method to surface interpretable parameter-space mechanisms.
Submission Number: 205
Loading