Approximating Language Model Training Data from Weights

John Xavier Morris; Junjie Oscar Yin; Woojeong Kim; Vitaly Shmatikov; Alexander M Rush

Approximating Language Model Training Data from Weights

John Xavier Morris, Junjie Oscar Yin, Woojeong Kim, Vitaly Shmatikov, Alexander M Rush

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: inversion, training data reconstruction

TL;DR: we recover suitable training data given only model weights

Abstract: Modern language models often have open weights but closed training data. We formalize the problem of data recovery from model weights and propose several baselines and metrics. We develop a gradient-based approach that selects the highest-matching data from a large public text corpus and show its effectiveness at recovering data given only weights of the original and finetuned models. The training subset pinpointed by our method in a large corpus can be used to train another model to comparable performance. Even when none of the true training data is available, data selected by our method from publicly available Web documents can be used to train a competent model.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 1185

Loading