Keywords: pretraining data, pretraining, linear, linear features, interpretability, linear representations, corpus frequency
TL;DR: Language Models (LMs) form linear representations of relations when the average subject-object frequency surpasses a certain threshold
Abstract: Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data's effect on downstream task behavior, we investigate its relationship to LM representations. Previous work has discovered that, in language models, some concepts are encoded as ``linear representations'', but what factors cause these representations to form (or not)? We study the connection between differences in pretraining data frequency and differences in trained models' linear representations of factual recall relations. We find evidence that the two are linked, with the formation of linear representations strongly connected to pretraining term frequencies. First, we establish that the presence of linear representations for subject-relation-object (s-r-o) fact triplets is highly correlated with both subject-object co-occurrence frequency and in-context learning accuracy. This is the case across all phases of pretraining, i.e., it is not affected by the model's underlying capability. In OLMo 7B and GPT-J (6B), we discover that a linear representation consistently (but not exclusively) forms when the subjects and objects within a relation co-occur at least 1-2k times, regardless of when these occurrences happen during pretraining. In the OLMo 1B model, consistent linearity only occurs after 4.4k occurrences, suggesting a connection to scale. Finally, we train a regression model on measurements of linear representation quality that can predict how often a term was seen in pretraining. We show such model achieves low error even for a different model and pretraining dataset, providing a new unsupervised method for exploring possible data sources of closed-source models. We conclude that the presence or absence of linear representations in LMs contains signal about their pretraining corpora that may provide new avenues for controlling and improving model behavior. We release our code to support future work.
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12604
Loading