Keywords: analysis, interpretability, mechanistic interpretability, context vs prior knowledge, large language models
TL;DR: The tension of choosing between in-context information and prior knowledge when prompted is fundamental to LMs; we use mechanistic interpretability techniques to find a knob which controls this.
Abstract: When making predictions, a language model must trade off how much it relies on its context vs. its prior knowledge.
Choosing how sensitive the model is to its context is a fundamental functionality, as it enables the model to excel at tasks like retrieval-augmented generation and question-answering.
In this paper, we search for a knob which controls this sensitivity, determining whether language models answer from the context or their prior knowledge.
To guide this search, we design a task for controllable context sensitivity.
In this task, we first feed the model a context ("Paris is in England") and a question ("Where is Paris?"); we then instruct the model to either use its prior or contextual knowledge and evaluate whether it generates the correct answer for both intents (either "France" or "England").
When fine-tuned on this task, instruct versions of Llama-3.1, Mistral-v0.3, and Gemma-2 can solve it with high accuracy (85-95%).
Analyzing these high-performing models, we narrow down which layers may be important to context sensitivity using a novel linear time algorithm.
Then, in each model, we identify a 1-D subspace in a single layer that encodes whether the model follows context or prior knowledge.
Interestingly, while we identify this subspace in a fine-tuned model, we find that the exact same subspace serves as an effective knob in not only that model but also non-fine-tuned instruct and base models of that model family.
Finally, we show a strong correlation between a model's performance and how distinctly it separates context-agreeing from context-ignoring answers in this subspace.
These results suggest a single fundamental subspace facilitates how the model chooses between context and prior knowledge.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12441
Loading