Certifying LLM Safety against Adversarial Prompting

Aounon Kumar; Chirag Agarwal; Suraj Srinivas; Aaron Jiaxun Li; Soheil Feizi; Himabindu Lakkaraju

Certifying LLM Safety against Adversarial Prompting

Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, Himabindu Lakkaraju

Published: 10 Jul 2024, Last Modified: 09 Sept 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Safety

Keywords: Certified Defense, Adversarial Attacks, Safety

TL;DR: We introduce Erase-and-Check, the first framework designed to defend against adversarial prompts with certifiable safety guarantees.

Abstract: Large language models (LLMs) are vulnerable to adversarial attacks, which add maliciously designed token sequences to bypass the model’s safety guardrails and cause it to produce harmful content. In this work, we introduce erase-and-check, the first framework to defend against adversarial prompts with certifiable safety guarantees. Given a prompt, our erase-and-check method erases tokens individually and inspects the resulting subsequences using a safety filter, declaring it harmful if any of the subsequences are detected as harmful. Our safety filters are implemented by leveraging Llama 2 and DistilBERT. We theoretically demonstrate that our method detects harmful prompts with accuracy at least as high as the safety filter. Additionally, we propose three efficient empirical defenses inspired by our erase-and-check (EC) method: i) RandEC, a randomized subsampling version of erase-and-check; ii) GreedyEC, which greedily erases tokens that maximize the softmax score of the harmful class; and iii) GradEC, which uses gradient information to optimize the tokens to erase. Extensive empirical evaluation with real-world datasets demonstrates the effectiveness of the proposed methods in defending against state-of-the-art adversarial prompting attacks.

Supplementary Material: zip

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 557

Loading