Abstract: Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 22 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not always enhance resilience against poisoning attacks and the influence on model resilience varies among different model suites. (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data.
These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.
Lay Summary: When training AI models to align with human preferences, malicious actors can "poison" the training data—sneaking in hidden biases or harmful content. This manipulation can cause models to generate unsafe or unintended outputs while appearing normal, posing serious risks. We introduce PoisonBench, a benchmark to test how vulnerable AI models are to such attacks. We evaluated 22 popular models across realistic scenarios, uncovering key weaknesses: larger models aren’t always more resilient, attacks scale predictably with poison levels, and their effects can spread to untriggered scenarios. Our findings expose critical flaws in current AI training methods, urging the development of stronger defenses to prevent misuse and ensure safer, more reliable AI systems.
Link To Code: https://github.com/TingchenFu/PoisonBench
Primary Area: Deep Learning->Large Language Models
Keywords: backdoor attack, data poisoning, preference learning
Submission Number: 11415
Loading