Keywords: adaptive attacks, security, jailbreaking, adversarial robustness
TL;DR: We test existing defenses through the Purple Problem showing that adaptive attacks and scaling compute is important for jailbreaking and provide guidelines for best practices in preventing a false sense of security.
Abstract: The rise of ''jailbreak'' attacks on language models has led to a flurry of defenses aimed at preventing undesirable responses. Nonetheless, most benchmarks remain to be solved, not to mention real-world safety problems. We critically examine the two stages of the defense pipeline: (i) defining what constitutes unsafe outputs, and (ii) enforcing the definition via methods such as fine-tuning or input preprocessing. To understand whether we fail because of definition or enforcement, we consider a simple and well-specified definition of unsafe outputs---outputs that contain the word ''purple''. Surprisingly, all existing fine-tuning and input defenses fail to enforce this definition under adaptive attacks and increasing compute, casting doubt on whether enforcement algorithms can be robust for more complicated definitions. We hope that this definition serves as a testbed to evaluate enforcement algorithms and prevent a false sense of security.
Submission Number: 168
Loading