Tracing Facts or just Copies? A critical investigation of the Competitions of Mechanisms in Large Language Models

TMLR Paper4326 Authors

22 Feb 2025 (modified: 09 Jun 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper presents a reproducibility study examining how Large Language Models (LLMs) manage competing factual and counterfactual information, focusing on the role of attention heads in this process. We attempt to reproduce and reconcile findings from three recent studies by Ortu et al. [16], Yu, Merullo, and Pavlick [21] and McDougall et al. [8] that investigate the competition between model-learned facts and contradictory context information through Mechanistic Interpretability tools. Our study specifically examines the relationship between attention head strength and factual output ratios, evaluates competing hypotheses about attention heads' suppression mechanisms, and investigates the domain specificity of these attention patterns. Through this analysis, we aim to provide a clearer understanding of how different model components contribute to managing conflicting information in LLMs.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: Daphne Ippolito
Submission Number: 4326
Loading