In-Context Learning, Can It Break Safety?

Sophie Xhonneux; David Dobre; Michael Noukhovitch; Jian Tang; Gauthier Gidel; Dhanya Sridhar

In-Context Learning, Can It Break Safety?

Sophie Xhonneux, David Dobre, Michael Noukhovitch, Jian Tang, Gauthier Gidel, Dhanya Sridhar

Published: 28 Jun 2024, Last Modified: 25 Jul 2024NextGenAISafety 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: ICL, Large Language Models, Safety

TL;DR: We show that ICL only sometimes works as an attack vector contrary to contemporary work.

Abstract: Despite significant investment into safety training, large language models (LLMs) deployed in the real world still suffer from numerous vulnerabilities. We investigate whether ICL can undo safety training, which could represent a major security risk. For the safety task, we look at \textsc{Vicuna-7B}, \textsc{Starling-7B}, and \llama{} models. We show that the attack works out-of-the-box on \textsc{Starling-7B} and \textsc{Vicuna-7B} but fails on \llama{} models. We propose an ICL attack that uses the chat template tokens like a prompt injection attack to achieve a better attack success rate on \textsc{Vicuna-7B} and \textsc{Starling-7B}. We further verify by looking at the log likelihood that ICL increases the chance of a harmful output even on the \llama{} models, but contrary to contemporary work observe a plateau in the probability, and thus find the models to be safe even for a very high number of examples. {\textcolor{red}{\textbf{Trigger Warning: the appendix contains LLM-generated text with violence, suicide, and misinformation.}}}

Submission Number: 118

Loading