Independence Tests for Language Models

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, fingerprinting, finetuning
TL;DR: We propose statistical tests to determine if two open-weight language models are independently trained from each other or not, i.e. one is finetuned.
Abstract: We consider the following problem of model provenance: can a third party verify whether two language models are trained independently or not given the weights of both models? We propose a family of statistical tests for models of any architecture that yield exact p-values with respect to the null hypothesis that the models are trained with independent randomness (e.g., independent random initialization). These p-values are valid regardless of the composition of either model's training data, and we obtain them by simulating independent copies of each model and comparing various measures of similarity in the weights and activations of the original two models to these independent copies. We evaluate the power of these tests on pairs of 21 open-weight models (210 total pairs) and find they reliably identify all 69 pairs of fine-tuned models. Notably, our tests remain effective even after substantial fine-tuning; we accurately detect dependence between Llama 2 and Llemma, even though the latter was fine-tuned on an 750B additional tokens (37.5\% of the original Llama 2 training budget). Finally, we identify transformations of model weights that break the effectiveness of our tests without altering model outputs, and—motivated by the existence of these evasion attacks—we propose a mechanism for matching hidden activations between the MLP layers of two models that is robust to these transformations. Though we no longer obtain exact p-values from this mechanism, empirically we find it reliably distinguishes fine-tuned models and pruned models of different dimension and is even robust to completely retraining the MLP layers from scratch.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12531
Loading