WINTER SOLDIER: HYPNOTIZING LANGUAGE MODELS AT PRE-TRAINING WITH INDIRECT DATA POISONING

Published: 06 Mar 2025, Last Modified: 16 Apr 2025WMARK@ICLR2025EveryoneRevisionsBibTeXCC BY 4.0
Track: tiny / short paper (3-5 pages)
Keywords: data poisoning, dataset watermarking, dataset ownership verification
TL;DR: We show how data poisoning can be used to detect if a model was trained on a protected dataset.
Abstract: The pre-training of large language models (LLMs) relies on massive text datasets sourced from diverse and difficult-to-curate origins. While membership inference attacks and hidden canaries have been explored to trace data usage, such methods rely on memorization of the training data, which LM providers try to limit. We suggest to instead perform an indirect data poisoning (where the targeted behavior is hidden) to protect a dataset before sharing it. Using gradient-based optimization prompt-tuning, we make a model learn arbitrary *secret sequences*: secret responses to secret prompts that are **absent from the training corpus**.\ We demonstrate our approach on language models pre-trained from scratch and show that less than $0.005\%$ of poisoned tokens are sufficient to covertly make a LM learn a secret, and detect it with a theoretically certifiable $p$-value as low as $10^{-55}$. All without performance degradation (as measured on LM benchmarks) and despite secrets **never appearing in the training set**.
Presenter: ~Wassim_Bouaziz1
Format: Yes, the presenting author will definitely attend in person because they are attending ICLR for other complementary reasons.
Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 61
Loading