Keywords: Mechanistic Interpretability, Pruning, Science of Deep Learning, AI Safety
TL;DR: We identify the common workflow for mechanistic interpretability work, and automate its “systematic ablations” step with a new algorithm, ACDC.
Abstract: Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of
transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers
choose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find which
abstract neural network units are involved in the behavior. By varying the dataset, metric, and units under
investigation, researchers can understand the functionality of each component.
We automate one of the process' steps: finding the connections between the abstract neural network units that form a circuit. We propose several algorithms and reproduce previous interpretability results to validate them. For
example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the
Greater-Than operation. ACDC selected 68 of the 32,000 edges in GPT-2 Small, all of which were manually found by
previous work. Our code is available at https://github.com/ArthurConmy/Automatic-Circuit-Discovery
Submission Number: 14912
Loading