Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning

Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning

ICLR 2026 Conference Submission21116 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: data poisoning, language model, ai security, dataset ownership verification, training data membership, privacy, copyright

TL;DR: We show how indirect data poisoning against language model pre-training are possible and how to use it to detect a model trained on a protected dataset.

Abstract: The pre-training of large language models (LLMs) relies on massive text datasets sourced from diverse and difficult-to-curate origins. Although membership inference attacks and hidden canaries have been explored to trace data usage, such methods rely on *regurgitation* of training data, which LM providers try to limit. In this work, we demonstrate that *indirect data poisoning* (where the targeted behavior is absent from training data) is not only feasible against LLMs but also allows to effectively protect a dataset and trace its use. Using gradient-based optimization prompt-tuning, we craft poisons to make a model learn arbitrary *secret sequences*: secret responses to secret prompts that are **absent from the training corpus**.\ We validate our approach on language models pre-trained from scratch and show that less than 0.005\% of poisoned tokens are sufficient to covertly make a LM learn a *secret* and detect it with extremely high confidence ( $p < 10^{-55}$ ) with a theoretically certifiable scheme. Crucially, this occurs without performance degradation (on LM benchmarks) and despite secrets **never appearing in the training set**.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 21116

Loading