Coercing LLMs to do and reveal (almost) anything

Published: 04 Mar 2024, Last Modified: 14 Apr 2024SeT LLM @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Adversarial Attacks, LLMs, generative LMs, white-box, jailbreak, safety
TL;DR: We show what attack goals can be achieved with adversarial attacks against large language models beyond just jailbreaking.
Abstract: It has recently been shown that adversarial attacks on large language models (LLMs) can "jailbreak'' the model into outputting harmful text. In this work, we argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking and provide a broad overview of possible attack surfaces and attack goals. Based on a series of concrete examples, we discuss attacks that coerce varied unintended behaviors, such as misdirection, model control, denial-of-service, or data extraction. We then analyze the mechanism by which these attacks function, highlighting the use of glitch tokens, and the propensity of attacks to control the model by coercing it to simulate code.
Submission Number: 73
Loading