Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Mechanistic Interpretability, Interpretability, Fact, Factual Recall, LLM, Explainability, Transparency
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We reverse engineer several independent mechanisms by which models perform the task of factual recall, and show that they combine in an additive manner, constructively interfering on correct answers.
Abstract: How do large language models (LLMs) store and retrieve knowledge? We focus on the most basic form of this task -- factual recall, where the model is tasked with explicitly surfacing stored facts in prompts of form \tokens{Fact: The Colosseum is in the country of}. We find that the mechanistic story behind factual recall is more complex than previously thought -- We show there exist four distinct and independent mechanisms that additively combine, constructively interfering on the correct attribute. We term this generic phenomena the \textbf{additive motif}: models compute correct answers through adding together multiple independent contributions; the contributions from each mechanism are insufficient alone, but together they constructively interfere on the correct attribute when summed. In addition, we extend the method of direct logit attribution to attribute a head's output to individual source tokens. We use this technique to unpack what we call `mixed heads' -- which are themselves a pair of two separate additive updates.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9284
Loading