Research Area: Science of LMs
Keywords: Mechanistic Interpretability, Deep Learning
TL;DR: We introduce circuit probing, a novel technique that automatically uncovers low-level circuits that compute hypothesized intermediate variables.
Abstract: Neural network models have achieved high performance on a wide variety
of complex tasks, but the algorithms that they implement are notoriously
difficult to interpret. It is often necessary to hypothesize intermediate variables involved in a network’s computation in order to understand these
algorithms. For example, does a language model depend on particular
syntactic properties when generating a sentence? Yet, existing analysis
tools make it difficult to test hypotheses of this type. We propose a new
analysis technique – circuit probing – that automatically uncovers low-level
circuits that compute hypothesized intermediate variables. This enables
causal analysis through targeted ablation at the level of model parameters.
We apply this method to models trained on simple arithmetic tasks, demonstrating its effectiveness at (1) deciphering the algorithms that models have
learned, (2) revealing modular structure within a model, and (3) tracking
the development of circuits over training. Across these three experiments
we demonstrate that circuit probing combines and extends the capabilities of existing methods, providing one unified approach for a variety of
analyses. Finally, we demonstrate circuit probing on a real-world use case:
uncovering circuits that are responsible for subject-verb agreement and
reflexive anaphora in GPT2-Small and Medium.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 418
Loading