Boundary Point Jailbreaking of Black-Box LLMs
Keywords: Jailbreaking, LLM Security, Adversarial Attacks
TL;DR: Boundary Point Jailbreaking combines curriculum learning and high-signal boundary points to make fully black-box jailbreak optimisation practical against state-of-the-art classifier defences.
Abstract: Frontier LLMs are commonly safeguarded against attempts to extract harmful information via adversarial prompts known as "jailbreaks". Expert and automated red teaming is the de facto measure of safeguard efficacy. Recently, defenders have developed classifier-based systems that have survived thousands of hours of such red teaming. In this work, we introduce Boundary Point Jailbreaking (BPJ), a new class of automated jailbreak attacks that evade the strongest industry-deployed safeguards. Unlike previous attacks that rely on white/grey-box assumptions (such as classifier scores or gradients) or libraries of existing jailbreaks, BPJ is fully black-box and uses only a single bit of information per query: whether or not the classifier flags the interaction. To achieve this, BPJ addresses the core difficulty in optimising attacks against robust real-world defences: evaluating whether a proposed modification to an attack is an improvement. Instead of directly trying to learn an attack for a target harmful string, BPJ converts the string into a curriculum of intermediate attack targets and then actively selects evaluation points that best detect small changes in attack strength ("boundary points"). To the best of our knowledge, BPJ is the first fully automated attack algorithm that succeeds in developing universal jailbreaks against Constitutional Classifiers. BPJ attacks are difficult to defend in individual interactions but incur a relatively large number of flags during optimisation; these results suggest that effective defence requires supplementing single-interaction methods with batch-level monitoring.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 103
Loading