Discovering Variable Binding Circuitry with Desiderata

Xander Davies; Max Nadeau; Nikhil Prakash; Tamar Rott Shaham; David Bau

Discovering Variable Binding Circuitry with Desiderata

Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, David Bau

Published: 23 Jun 2023, Last Modified: 06 Jul 2023DeployableGenerativeAIEveryoneRevisions

Keywords: interpretability, NLP, transformers

TL;DR: We use causal intervention desiderata to automatically discover shared variable binding circuitry in LLaMA-13B.

Abstract: Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of $\textit{desiderata}$, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared variable binding circuitry in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token's residual stream.

Submission Number: 54

Loading