Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: language model interpretability, interpretability, mechanistic interpretability, circuit analysis, activation patching, large language models
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We find that varying the hyperparameters of activation patching can lead to different interpretability results and give recommendations for best practices.
Abstract: Mechanistic interpretability seeks to understand the internal mechanisms of
machine learning models, where localization—identifying the important model
components—is a key step. Activation patching, also known as causal tracing or
interchange intervention, is a standard technique for this task (Vig et al., 2020), but
the literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impact
of methodological details in activation patching, including evaluation metrics and
corruption methods. In several settings of localization and circuit discovery in language models, we find that varying these hyperparameters could lead to disparate
interpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. Finally, we provide
recommendations for the best practices of activation patching going forwards.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Primary Area: visualization or interpretation of learned representations
Submission Number: 2241
Loading