Keywords: Jailbreaking, Adversarial Robustness, Security, Adaptive Attacks
TL;DR: We test existing defenses through the Purple Problem showing that adaptive attacks and scaling compute is important for jailbreaking and provide guidelines for best practices in preventing a false sense of security.
Abstract: The rise of ''jailbreak'' attacks on language models has led to a flurry of defenses aimed at preventing undesirable responses. Nonetheless, most benchmarks remain to be solved, not to mention real-world safety problems. We critically examine the two stages of the defense pipeline: (i) defining what constitutes unsafe outputs, and (ii) enforcing the definition via methods such as fine-tuning or input preprocessing. To understand whether we fail because of definition or enforcement, we consider a simple and well-specified definition of unsafe outputs---outputs that contain the word ''purple''. Surprisingly, all existing fine-tuning and input defenses fail to enforce this definition under adaptive attacks and increasing compute, casting doubt on whether enforcement algorithms can be robust for more complicated definitions. We hope that this definition serves as a testbed to evaluate enforcement algorithms and prevent a false sense of security.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13112
Loading