A Realistic Threat Model for Large Language Model Jailbreaks

Published: 09 Oct 2024, Last Modified: 03 Jan 2025Red Teaming GenAI Workshop @ NeurIPS'24 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, jailbreaks, threat model, robustness
TL;DR: Jailbreaking attacks are not comparable - we propose a way to do so via a realistic threat model and show, how to adapt popular attacks to it.
Abstract: A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safety-tuned LLMs. In their original settings, these methods all largely succeed in coercing the target output. Yet, the resulting jailbreaks vary substantially in readability and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods. Our threat model combines practical considerations with constraints along two dimensions: perplexity, which measures how far a jailbreak deviates from natural text, and computational budget in total FLOPs. For the former we built an N-gram model on 1T tokens, which, in contrast to model-based perplexity, allows for a neutral, LLM-agnostic, and intrinsically interpretable evaluation. Moreover, we adapt existing popular attacks to this threat model. Our threat model enables a comprehensive and precise comparison of various jailbreaking techniques within a single realistic framework. We further find that, under this threat model, even the most effective attacks, when thoroughly adapted, struggle to achieve success rates above 40% against safety-tuned models. This indicates that in a realistic chat scenario, current LLMs are less prone to attacks as it was believed before.
Serve As Reviewer: valentyn.boreiko@gmail.com
Submission Number: 33
Loading