Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion

ACL ARR 2024 June Submission1566 Authors

14 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent work in mechanistic interpretability of language models (LMs) has established that fact completion is mediated by localized computations. However, these findings rely on the assumption that the same computations occur for all predictions, as long as the model is accurate, and aggregate results for these. Meanwhile, a parallel body of work has shown that accurate fact completions can result from various inference processes, including predictions based on superficial properties of the query or even pure guesswork. In this paper, we present a taxonomy of relevant prediction mechanisms and observe that a well-known dataset for interpreting the inference process of LMs for fact completion misses important distinctions in this taxonomy. With this in mind, we propose a model-specific recipe for constructing precise testing data, which we call PrepMech. We use this data to investigate the sensitivity of a popular interpretability method, causal tracing (CT), to different prediction mechanisms. We find that while CT produces different results for different mechanisms, aggregations are only representative of the mechanism that corresponds to the strongest signal. In summary, we contribute tools for a more granular study of fact completion in language models and analyses that provide a more nuanced understanding of the underlying mechanisms.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: data artifacts, knowledge tracing, probing, robustness
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 1566
Loading