LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

Published: 09 Oct 2024, Last Modified: 03 Jan 2025Red Teaming GenAI Workshop @ NeurIPS'24 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: language models, ai security, ai safety, robustness, adversarial attacks
TL;DR: Current LLM defenses, which are remarkably robust against automated adversarial attacks, are not robust against humans who attack over multiple turns -- a more realistic threat model of malicious use in the real world.
Abstract: Recent large language model (LLM) defenses have greatly improved models' ability to refuse harmful queries, even when adversarially attacked. However, LLM defenses are primarily evaluated against automated adversarial attacks in a single turn of conversation, an insufficient threat model for real-world malicious use. We demonstrate that multi-turn human jailbreaks uncover significant vulnerabilities, exceeding 70% attack success rate (ASR) on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks. Human jailbreaks also reveal vulnerabilities in machine unlearning defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. We compile these results into Multi-Turn Human Jailbreaks (MHJ), a dataset of 2,912 prompts across 537 multi-turn jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed across dozens of commercial red teaming engagements, supporting research towards stronger LLM defenses.
Serve As Reviewer: contact@nli.slmail.me
Submission Number: 5
Loading