Universal Jailbreak Backdoors in Large Language Model Alignment

Thomas Baumann

Universal Jailbreak Backdoors in Large Language Model Alignment

Thomas Baumann

Published: 12 Oct 2024, Last Modified: 14 Nov 2024SafeGenAi PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Jailbreak, SafeGenAi, DPO, RLHF, KTO, IPO, ORPO

TL;DR: We jailbreak DPO, IPO, KTO and ORPO with a backdoor.

Abstract: Aligning large language models is essential to obtain models that generate help- ful and harmless responses. However, it has been shown that these models are prone to jailbreaking attacks by reverting them to their unaligned state via adver- sarial prompt engineering or poisoning of the alignment process. Prior work has introduced a ”universal jailbreak backdoor” attack, in which an attacker poisons the training data used for reinforcement learning from human feedback (RLHF). This work further explores the universal jailbreak backdoor attack, by applying it to other alignment techniques, namely direct preference optimization (DPO), identity preference optimization (IPO), Kahneman-Tversky optimization (KTO) and odds ratio preference optimization (ORPO). We compare our findings with previous results and question the robustness of the named algorithms.

Submission Number: 218

Loading