Targeted Recovery of Weight-Space Mechanisms From Neural Networks

Antoine Vigouroux; Lee Sharkey

Targeted Recovery of Weight-Space Mechanisms From Neural Networks

Antoine Vigouroux, Lee Sharkey

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Circuit Analysis, Attribution Graphs, Methods (probing, steering, causal interventions)

TL;DR: Targeted Parameter Decomposition (tPD) recovers faithful, editable weight-space mechanisms for a chosen subset of inputs at a fraction of full-data PD's compute cost.

Abstract: Parameter decomposition (PD) decomposes neural networks into interpretable computational components that faithfully reflect the original network's operations. However, scaling PD to large models requires vast compute, making it a costly and risky endeavor. Here we propose targeted PD (tPD), which identifies only the components that process specific inputs of interest – from isolated prompts to large subtasks – by introducing a high-rank catch-all component that handles all non-target data. We validate tPD on toy models and on transformer language models trained on The Pile, where it recovers reproducible, mechanistically faithful circuits. We extract a CSS-only submodel of a 4-block transformer using $\approx$7\% of the FLOPs of its published decomposition, and in a 12-block transformer we surgically ablate and rewire memorized sequences, with negligible side effects on other inputs.

Submission Number: 396

Loading