Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts

Jiahai Feng; Stuart Russell; Jacob Steinhardt

Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts

Jiahai Feng, Stuart Russell, Jacob Steinhardt

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0

TL;DR: Pretraining LMs create extractive structures in LMs that enable LMs to generalize to implications of new facts seen during finetuning.

Abstract: Pretrained language models (LMs) can generalize to implications of facts that they are finetuned on. For example, if finetuned on "John Doe lives in Tokyo," LMs correctly answer "What language do the people in John Doe's city speak?'' with "Japanese''. However, little is known about the mechanisms that enable this generalization or how they are learned during pretraining. We introduce extractive structures as a framework for describing how components in LMs (e.g., MLPs or attention heads) coordinate to enable this generalization. The structures consist of informative components that store training facts as weight changes, and upstream and downstream extractive components that query and process the stored information to produce the correct implication. We hypothesize that extractive structures are learned during pretraining when encountering implications of previously known facts. This yields two predictions: a data ordering effect where extractive structures can be learned only if facts precede their implications, and a weight grafting effect where extractive structures can be grafted to predict counterfactual implications. We empirically show these effects in the OLMo-7b, Llama 3-8b, Gemma 2-9b, and Qwen 2-7b models. Of independent interest, our results also indicate that fact learning can occur at both early and late layers, which lead to different forms of generalization.

Lay Summary: Language models that underlie modern chat assistants seem knowledgeable and intelligent about great many things. We are curious about how they actually learn all this information and develop these reasoning skills. To explore this, we take pretrained language models that are already quite smart, and teach them new facts by further training on them. For example, if we teach the model that "John Doe lives in Tokyo", we are interested in whether the model can figure out on its own that "The people in the city John Doe lives in speak Japanese". If so, the model isn't just memorizing facts it sees in training; it's reasoning and making connections. Our research aims to understand how it does this. What internal mechanisms in the model allows it to reason, and how did the model learn these internal mechanisms. To answer these questions, we develop the "Extractive Structures" framework, where we decompose the internals of a language model into "informative components", which store factual information, and "extractive components", which extract and process the stored information. Our work helps us pinpoint these components in a language model, and even predict how well language models learn new things based on how the components are set up. Overall, this research helps us build a clearer picture of how AI learns and reasons, an important step towards developing safe and reliable AI systems.

Link To Code: https://github.com/jiahai-feng/extractive-structures

Primary Area: Social Aspects->Accountability, Transparency, and Interpretability

Keywords: interpretability, language models, generalization

Submission Number: 5880

Loading