On Linear Representations and Pretraining Data Frequency in Language Models

NeurIPS 2024 Workshop ATTRIB Submission58 Authors

Published: 30 Oct 2024, Last Modified: 14 Jan 2025ATTRIB 2024EveryoneRevisionsBibTeXCC BY 4.0
Release Opt Out: No, I don't wish to opt out of paper release. My paper should be released.
Keywords: pretraining data, mechanistic interpretability, linear representations, membership inference attacks
Abstract: Pretraining data has a direct impact on the behaviors and quality of language models (LMs), but we only understand the most basic principles of this relationship. While most work focuses on pretraining data and downstream task behavior, we look at the effect on LM representations. Previous work has discovered that, in language models, some concepts are encoded as ``linear representations'' argued to be highly interpretable and useful for controllable generation. We study the connection between differences in pretraining data frequency and differences in trained models' linear representations of factual recall relations. We find evidence that the two are directly linked, with the formation of linear representations strongly connected to pretraining term frequencies. First, we establish that the presence of linear representations for subject-relation-object-formatted facts is highly correlated with both subject-object co-occurrence frequency and in-context learning accuracy. This is the case across all phases of pretraining, i.e., it is not affected by the model's underlying capability. In OLMo 7B and GPT-J (6B), we find that a linear representation forms predictably when the subjects and objects within a relation co-occur at least 1--2k times. Thus, it appears linear representations form as a result of consistent repeated occurrences, not due to lengthy pretraining time. In the OLMo 1B model, formation of these features only occurs after 4.4k occurrences. Finally, we train a regression model on measurements of linear representation robustness that can predict how often a term was seen in pretraining with low error, which generalizes to GPT-J without additional training, providing a new unsupervised method for exploring how possible data sources of closed-source models. We conclude that the presence/absence of linear representations contain a weak but significant signal that reflects an imprint of the pretraining corpus across LMs.
Submission Number: 58
Loading