Keywords: data poisoning, ai security, language model safety
TL;DR: We show that data-poisoning attacks on LLM pretraining and fine-tuning require a near-constant number of poison samples regardless of clean dataset size or model size.
Abstract: Poisoning attacks can compromise the safety of large language models (LLMs)
by injecting malicious documents into their training data. Existing work has
studied pretraining poisoning assuming adversaries control a *percentage* of the
training corpus. However, for large models, even small percentages translate to
impractically large amounts of data. This work demonstrates for the first time that
poisoning attacks instead require a *near-constant number of documents regardless
of dataset size*. We conduct the largest pretraining poisoning experiments to date,
pretraining models from 600M to 13B parameters on chinchilla-optimal datasets
(6B to 260B tokens). We find that 250 poisoned documents similarly compromise
models across all model and dataset sizes, despite the largest models training
on more than 20 times more clean data. We also run smaller-scale experiments
to ablate factors that could influence attack success, including broader ratios of
poisoned to clean data and non-random distributions of poisoned samples. Finally,
we demonstrate the same dynamics for poisoning during fine-tuning. Altogether,
our results suggest that injecting backdoors through data poisoning may be easier
for large models than previously believed as the number of poisons required does
not scale up with model size—highlighting the need for more research on defences
to mitigate this risk in future models.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 19208
Loading