CosyCPT: Coreness-Aware Synthetic Continued Pretraining

CosyCPT: Coreness-Aware Synthetic Continued Pretraining

ICLR 2026 Conference Submission22241 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: synthetic continued pretraining, knowledge acquisition, large language models, graph mining, sampling, data augmentation

Abstract: Synthetic continued pretraining adapts LLMs to specific domains by fine-tuning them on synthetic data that augments real domain data. However, existing methods are often data-inefficient (requiring massive synthetic corpora to enumerate all relational facts) and fail to account for the relative importance of different entity relationships. In this paper, we propose coreness-aware synthetic continued pretraining (CosyCPT), a systematic pipeline that addresses both limitations. Our method (1) constructs a graph representation of entity relations in a document, (2) quantifies relation importance via coreness scores derived from the graph, and (3) leverages these scores to guide synthetic data sampling and augmentation for continued pretraining. We investigate four definitions of entity coreness and four formulations of relation coreness, verifying that multiple variants of coreness-aware sampling can outperform random sampling of augmented data for synthetic continued pretraining. We offer a mathematical analysis, proving that (1) given a learning budget, maximizing the expected accuracy on a query set about relational knowledge in a document collection is an NP-complete problem, (2) coreness-aware sampling is the optimal solution when each query examines one entity pair, and (3) coreness-aware sampling has a better upper bound for expected accuray than random sampling.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 22241

Loading