Keywords: Circuit Analysis, Attribution Graphs, Methods (probing, steering, causal interventions)
TL;DR: Targeted Parameter Decomposition (tPD) recovers faithful, editable weight-space mechanisms for a chosen subset of inputs at a fraction of full-data PD's compute cost.
Abstract: Parameter decomposition (PD) decomposes neural networks into interpretable computational components that faithfully reflect the original network's operations. However, scaling PD to large models requires vast compute, making it a costly and risky endeavor. Here we propose targeted PD (tPD), which identifies only the components that process specific inputs of interest – from isolated prompts to large subtasks –
by introducing a high-rank catch-all component that handles all non-target data. We validate tPD on toy models and on transformer language models trained on The Pile, where it recovers reproducible, mechanistically faithful circuits. We extract a CSS-only submodel of a 4-block transformer using $\approx$7\% of the FLOPs of its published decomposition, and in a 12-block transformer we surgically ablate and rewire memorized sequences, with negligible side effects on other inputs.
Submission Number: 396
Loading