Keywords: LLM, jailbreak, benchmark
Abstract: The rapid adoption of large language models (LLMs) in high-stakes domains like healthcare and legal reasoning has intensified concerns about their security vulnerabilities, particularly jailbreak attacks—where adversarial prompts bypass safety filters to elicit harmful outputs.
While several jailbreak benchmarks have been proposed, they fall short in capturing the compositional nature of real-world attacks, limiting their ability to explore a broader and more diverse attack space.
In this work, we present Jailbreak LEGO, a novel benchmarking framework that systematically extracts fine-grained, atomic strategy components from existing jailbreak attacks, with standardized interfaces for modular composition.
We formalize jailbreak prompts as structured triples and categorize extracted components into three functional types based on their transformation behavior. This design allows components to function like LEGO blocks—plug-and-play units that can be flexibly composed to reconstruct existing attacks or synthesize novel ones. Our benchmark encompasses 16 advanced jailbreak methods, 8 widely-used LLMs, and a library of 26 reusable strategy components. Experimental results demonstrate that compositional attacks produced by Jailbreak LEGO not only replicate prior methods but also uncover large amount of previously unseen vulnerabilities (e.g., achieving up to 91\% success rate on Claude-3.7). Jailbreak LEGO establishes a new standard for systematic red-teaming of LLMs.
Code is available at https://anonymous.4open.science/r/Jailbreak-LEGO-4CCD.
Primary Area: datasets and benchmarks
Submission Number: 16228
Loading