Observable Propagation: Uncovering Feature Vectors in Transformers

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: large language models, mechanistic interpretability, feature vectors
TL;DR: Using almost no data, you can find the features that transformer language models use in their computation, and understand how they cause gender bias in model outputs.
Abstract: A key goal of current mechanistic interpretability research in NLP is to find linear features (also called "feature vectors") for transformers: directions in activation space corresponding to concepts that are used by a given model in its computation. Present state-of-the-art methods for finding linear features require large amounts of labelled data -- both laborious to acquire and computationally expensive to utilize. In this work, we introduce a novel method, called "observable propagation" (in short: "ObsProp"), for finding linear features used by transformer language models in computing a given task -- using almost no data. Our paradigm centers on the concept of "observables", linear functionals corresponding to given tasks. We then introduce a mathematical theory for the analysis of feature vectors: we prove that LayerNorm nonlinearities in high dimensions do not affect the direction of feature vectors; we also introduce a similarity metric between feature vectors called the "coupling coefficient", and prove that it accurately estimates the degree to which one feature's output correlates with another's. Armed with these tools, we use observable propagation to investigate the features that cause gendered occupational bias in a large language model. In our experiments, we identify the specific features used by the model for predicting occupation based on a gendered name, and find that some of the same features are used by the model for predicting grammatical gender. Our results suggest that observable propagation can be used to better understand the mechanisms responsible for bias in large language models.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4345
Loading