Query-Based Adversarial Prompt Generation

Jonathan Hayase; Ema Borevković; Nicholas Carlini; Florian Tramèr; Milad Nasr

Query-Based Adversarial Prompt Generation

Jonathan Hayase, Ema Borevković, Nicholas Carlini, Florian Tramèr, Milad Nasr

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: adversarial examples, large language models, black box

TL;DR: We cause aligned models to output specific harmful strings using a black-box attack

Abstract: Recent work has shown it is possible to construct adversarial examples that cause aligned language models to emit harmful strings or perform harmful behavior. Existing attacks work either in the white-box setting (with full access to the model weights), or through _transferability_: the phenomenon that adversarial examples crafted on one model often remain effective on other models. We improve on prior work with a _query-based_ attack that leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with (much) higher probability than with transfer-only attacks. We validate our attack on GPT-3.5 and OpenAI's safety classifier; we can cause GPT-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the OpenAI and Llama Guard safety classifiers with nearly 100% probability.

Supplementary Material: zip

Primary Area: Safety in machine learning

Submission Number: 15452

Loading