Keywords: Jailbreak, SafeGenAi, DPO, RLHF, KTO, IPO, ORPO
TL;DR: We jailbreak DPO, IPO, KTO and ORPO with a backdoor.
Abstract: Aligning large language models is essential to obtain models that generate help-
ful and harmless responses. However, it has been shown that these models are
prone to jailbreaking attacks by reverting them to their unaligned state via adver-
sarial prompt engineering or poisoning of the alignment process. Prior work has
introduced a ”universal jailbreak backdoor” attack, in which an attacker poisons
the training data used for reinforcement learning from human feedback (RLHF).
This work further explores the universal jailbreak backdoor attack, by applying
it to other alignment techniques, namely direct preference optimization (DPO),
identity preference optimization (IPO), Kahneman-Tversky optimization (KTO)
and odds ratio preference optimization (ORPO). We compare our findings with
previous results and question the robustness of the named algorithms.
Submission Number: 218
Loading