Track: Extended abstract
Keywords: induction heads, interpretability, mechanistic interpretability, large language models
Abstract: For induction heads to copy forward information successfully, heads in earlier layers must first load previous token information into every hidden state, a process Olsson et al. (2022) call key shifting. While this information is hypothesized to exist, there have been few attempts to explicitly locate it in models. In this work, we use linear probes to identify the subspaces responsible for storing previous token information in Llama-2-7b and Llama-3-8b. We show that these subspaces are causally implicated in induction by using them to "edit" previous token information and trigger random token copying in new contexts.
Submission Number: 16
Loading