Keywords: data poisoning, language model, ai security, dataset ownership verification, training data membership, privacy, copyright
TL;DR: We show how indirect data poisoning against language model pre-training are possible and how to use it to detect a model trained on a protected dataset.
Abstract: The pre-training of large language models (LLMs) relies on massive text datasets sourced from diverse and difficult-to-curate origins.
Although membership inference attacks and hidden canaries have been explored to trace data usage, such methods rely on *regurgitation* of training data, which LM providers try to limit.
In this work, we demonstrate that *indirect data poisoning* (where the targeted behavior is absent from training data) is not only feasible against LLMs but also allows to effectively protect a dataset and trace its use.
Using gradient-based optimization prompt-tuning, we craft poisons to make a model learn arbitrary *secret sequences*: secret responses to secret prompts that are **absent from the training corpus**.\
We validate our approach on language models pre-trained from scratch and show that less than 0.005\% of poisoned tokens are sufficient to covertly make a LM learn a *secret* and detect it with extremely high confidence ( $p < 10^{-55}$ ) with a theoretically certifiable scheme.
Crucially, this occurs without performance degradation (on LM benchmarks) and despite secrets **never appearing in the training set**.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 21116
Loading