Protoknowledge Shapes Behaviour of LLMs in Downstream Tasks: Memorization and Generalization with Knowledge Graphs

ACL ARR 2026 January Submission4068 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Memorization, Generalization, Knowledge Graphs, Text-To-SPARQL
Abstract: We introduce the concept of protoknowledge to formalize and measure how Knowledge Graphs (KGs) are internalized during pretraining and reused at inference by Large Language Models (LLMs). LLMs are known to memorize vast amounts of token sequences and a central open question is how this memorization can serve as reusable knowledge through implicit abstraction and generalization. We categorize protoknowledge into lexical, hierarchical, and topological forms, reflecting different levels of abstraction over KGs. We measure these forms through Knowledge Activation Tasks (KATs), analyzing general properties such as semantic bias. We then examine how protoknowledge affects Text-to-SPARQL, a task requiring conformity to the target KG’s formal structure. To this end, we adopt a novel analysis framework that assesses whether model predictions align with the successful activation of the relevant protoknowledge for each query. We do not frame this phenomenon as data contamination alone: rather, protoknowledge provides a measurable signal of how LLMs internalize structured information during pretraining and reuse it in downstream tasks. This perspective offers a more nuanced view of semantic-level data contamination and supplies an effective strategy for interpreting the behaviour of Closed-Pretraining models.
Paper Type: Long
Research Area: Generalizability and Transfer
Research Area Keywords: Knowledge Graphs, Memorization, Generalization, SPARQL
Languages Studied: English
Submission Number: 4068
Loading