AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment Attacks
Track: Long Paper Track (up to 9 pages)
Keywords: LLM Alignment, Backdoor
Abstract: With the increasing adoption of reinforcement learning with human feedback
(RLHF) to align large language models (LLMs), the risk of backdoor installation
during the alignment process has grown, potentially leading to unintended and
harmful behaviors. Existing backdoor attacks mostly focus on simpler tasks, such
as sequence classification, making them either difficult to install in LLM alignment
or installable but easily detectable and removable. In this work, we introduce
AdvBDGen, a generative fine-tuning framework that automatically creates prompt-
specific paraphrases as triggers, enabling stealthier and more resilient backdoor
attacks in LLM alignment. AdvBDGen is designed to exploit the disparities in
learning speeds between strong and weak discriminators to craft backdoors that
are both installable and stealthy. Using as little as 3% of the fine-tuning data,
AdvBDGen can install highly effective backdoor triggers that, once installed, not
only jailbreak LLMs during inference but also exhibit greater stability against
input perturbations and improved robustness to trigger removal methods. Our
findings highlight the growing vulnerability of LLM alignment pipelines to ad-
vanced backdoor attacks, underscoring the pressing need for more robust defense
mechanisms.
Submission Number: 9
Loading