In-Context Learning, Can It Break Safety?

Published: 28 Jun 2024, Last Modified: 25 Jul 2024NextGenAISafety 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: ICL, Large Language Models, Safety
TL;DR: We show that ICL only sometimes works as an attack vector contrary to contemporary work.
Abstract: Despite significant investment into safety training, large language models (LLMs) deployed in the real world still suffer from numerous vulnerabilities. We investigate whether ICL can undo safety training, which could represent a major security risk. For the safety task, we look at \textsc{Vicuna-7B}, \textsc{Starling-7B}, and \llama{} models. We show that the attack works out-of-the-box on \textsc{Starling-7B} and \textsc{Vicuna-7B} but fails on \llama{} models. We propose an ICL attack that uses the chat template tokens like a prompt injection attack to achieve a better attack success rate on \textsc{Vicuna-7B} and \textsc{Starling-7B}. We further verify by looking at the log likelihood that ICL increases the chance of a harmful output even on the \llama{} models, but contrary to contemporary work observe a plateau in the probability, and thus find the models to be safe even for a very high number of examples. {\textcolor{red}{\textbf{Trigger Warning: the appendix contains LLM-generated text with violence, suicide, and misinformation.}}}
Submission Number: 118
Loading