Research Area: Safety
Keywords: Certified Defense, Adversarial Attacks, Safety
TL;DR: We introduce Erase-and-Check, the first framework designed to defend against adversarial prompts with certifiable safety guarantees.
Abstract: Large language models (LLMs) are vulnerable to adversarial attacks, which add maliciously designed token sequences to bypass the model’s safety guardrails and cause it to produce harmful content. In this work, we introduce erase-and-check, the first framework to defend against adversarial prompts with certifiable safety guarantees. Given a prompt, our erase-and-check method erases tokens individually and inspects the resulting subsequences using a safety filter, declaring it harmful if any of the subsequences are detected as harmful. Our safety filters are implemented by leveraging Llama 2 and DistilBERT. We theoretically demonstrate that our method detects harmful prompts with accuracy at least as high as the safety filter. Additionally, we propose three efficient empirical defenses inspired by our erase-and-check (EC) method: i) RandEC, a randomized subsampling version of erase-and-check; ii) GreedyEC, which greedily erases tokens that maximize the softmax score of the harmful class; and iii) GradEC, which uses gradient information to optimize the tokens to erase. Extensive empirical evaluation with real-world datasets demonstrates the effectiveness of the proposed methods in defending against state-of-the-art adversarial prompting attacks.
Supplementary Material: zip
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 557
Loading