TL;DR: We propose statistical tests to determine if two open-weight language models are independently trained from each other or not, i.e. one is finetuned.
Abstract: Motivated by liability and intellectual property concerns over open-weight models we consider the following problem: given the weights of two models, can we test whether they were trained independently---i.e., from independent random initializations? We consider two settings: *constrained* and *unconstrained*. In the constrained setting, we make assumptions about model architecture and training and propose statistical tests that yield exact p-values with respect to the null hypothesis that the models are trained from independent random initializations. We compute the p-values by simulating exchangeable copies of each model under our assumptions and comparing various similarity measures between the original two models versus these copies. We report p-values on pairs of 21 open-weight models (210 total pairs) and find we correctly identify all pairs of non-independent models. In the unconstrained setting we make none of the prior assumptions and allow for adversarial evasion attacks that do not change model output. We thus propose a new test which matches hidden activations between two models, which is robust to these transformations and to changes in model architecture and can also identify specific non-independent components of models. Though we no longer obtain exact p-values from this test, empirically we find it reliably distinguishes non-independent models like a p-value. Notably, we can use the test to identify specific parts of one model that are derived from another (e.g., how Llama 3.1-8B was pruned to initialize Llama 3.2-3B, or shared layers between Mistral-7B and StripedHyena-7B), and it is even robust to retraining individual layers of either model from scratch.
Lay Summary: As large language models become increasingly common it’s important to determine when one model has been adapted from another through a practice called “fine-tuning", which is a cost-effective and popular method for specializing the models in domains such as coding or law. However, this raises intellectual property issues and concerns about potential misuse, since the costs of developing models are increasing and come with additional legal strings attached. To address this, we developed statistical tests that detect when one model is derived from another. Our approach works by comparing the internal parameters (weights) of two models. We measure how similar two models’ weights are, compared to permuted versions of the same weights that represent a baseline for identical models. If the models are indeed related, their original weights will be significantly more similar than expected by chance, allowing us to reliably test for derivative relationships. Our tests accurately identified all derivative relationships among 210 pairs of open-source models, each with 7 billion parameters. Moreover, our methods proved robust across models of various sizes and types of modifications.
Link To Code: https://github.com/ahmeda14960/model-tracing
Primary Area: Deep Learning->Large Language Models
Keywords: language models, finetuning, fingerprinting
Submission Number: 9129
Loading