AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment Attacks

Published: 05 Mar 2025, Last Modified: 23 Mar 2025BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0
Track: Long Paper Track (up to 9 pages)
Keywords: LLM Alignment, Backdoor
Abstract: With the increasing adoption of reinforcement learning with human feedback (RLHF) to align large language models (LLMs), the risk of backdoor installation during the alignment process has grown, potentially leading to unintended and harmful behaviors. Existing backdoor attacks mostly focus on simpler tasks, such as sequence classification, making them either difficult to install in LLM alignment or installable but easily detectable and removable. In this work, we introduce AdvBDGen, a generative fine-tuning framework that automatically creates prompt- specific paraphrases as triggers, enabling stealthier and more resilient backdoor attacks in LLM alignment. AdvBDGen is designed to exploit the disparities in learning speeds between strong and weak discriminators to craft backdoors that are both installable and stealthy. Using as little as 3% of the fine-tuning data, AdvBDGen can install highly effective backdoor triggers that, once installed, not only jailbreak LLMs during inference but also exhibit greater stability against input perturbations and improved robustness to trigger removal methods. Our findings highlight the growing vulnerability of LLM alignment pipelines to ad- vanced backdoor attacks, underscoring the pressing need for more robust defense mechanisms.
Submission Number: 9
Loading