Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

NeurIPS 2023 Workshop ATTRIB Submission31 Authors

Published: 27 Oct 2023, Last Modified: 08 Dec 2023ATTRIB PosterEveryoneRevisionsBibTeX
Keywords: Mechanistic Interpretability, Natural Language Processing, Large Language Models
TL;DR: We show how activation patching can hallucinate meaningful subspaces in a language model by activating dormant pathways.
Abstract: Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features, often hypothesized to manifest as low-dimensional subspaces of activations. Specifically, recent studies have explored subspace interventions (such as activation patching) as a way to both manipulate model behavior and attribute the features behind it to given subspaces. In this work, we demonstrate that these two aims diverge, potentially leading to an illusory sense of interpretability. Counterintuitively, even if a subspace intervention modifies end-to-end model behavior in the desired way, this effect may be achieved by activating a \emph{dormant parallel pathway} leveraging a component that is \emph{causally disconnected} from model outputs. We demonstrate this phenomenon in a distilled mathematical example, in two real-world domains (the indirect object identification task and factual recall), and present evidence for its prevalence in practice. In the context of factual recall, we further show a link to rank-1 fact editing, providing a mechanistic explanation for previous work observing an inconsistency between fact editing performance and fact localization. Finally, we remark on what a success case of subspace activation patching looks like.
Submission Number: 31