Inducing Induction in Llama via Linear Probe Interventions

Published: 21 Sept 2024, Last Modified: 06 Oct 2024BlackboxNLP 2024EveryoneRevisionsBibTeXCC BY 4.0
Track: Extended abstract
Keywords: induction heads, interpretability, mechanistic interpretability, large language models
Abstract: For induction heads to copy forward information successfully, heads in earlier layers must first load previous token information into every hidden state, a process Olsson et al. (2022) call key shifting. While this information is hypothesized to exist, there have been few attempts to explicitly locate it in models. In this work, we use linear probes to identify the subspaces responsible for storing previous token information in Llama-2-7b and Llama-3-8b. We show that these subspaces are causally implicated in induction by using them to "edit" previous token information and trigger random token copying in new contexts.
Submission Number: 16
Loading