Jailbreak Defense in a Narrow Domain: Failures of existing methods and Improving Transcript-Based Classifiers
Keywords: jailbreak, defense, adversarial robustness, robustness, AI safety, defense
Abstract: Defending large language models against jailbreaks so that they never engage in a broad set of forbidden behaviors is an open problem. In this paper, we study if jailbreak-defense is more tractable if one only needs to forbid a very narrow set of behaviors. As a case study, we focus on preventing an LLM from helping a user make a bomb. We find that popular defenses such as safety training, adversarial training, and input/output classifiers are inadequate. In pursuit of a better defense, we develop our own classifier defense tailored to our bomb setting, which outperforms existing defenses on some axes but is still ultimately broken. We conclude that jailbreak-defense is unsolved, even in a narrow domain.
Submission Number: 32
Loading